Senior Data Engineer (Data + Applied AI)
What You'll Do
Design and manage end-to-end data pipelines in cloud data warehouses, ensuring reliability, scalability, and compliance with healthcare regulations. You'll implement data transformation workflows using dbt across multiple layers, enforce data quality through automated testing, and maintain clear documentation and lineage tracking.
Develop and optimize Airflow DAGs to orchestrate complex data workflows, including scheduling, error recovery, and alerting. Build dimensional models and data marts that support both business intelligence and machine learning use cases, following established modeling standards.
Integrate data from diverse sources—including electronic health records, payment systems, and third-party APIs—into a unified platform. Apply strict data handling protocols for PHI and PII, implementing masking, tokenization, and access controls across all systems.
Architect and deploy retrieval-augmented generation (RAG) pipelines using frameworks like LangChain or LangGraph, covering document processing, embedding generation, and semantic retrieval. Support MLOps practices by maintaining model training pipelines, monitoring performance, and enabling retraining workflows.
Collaborate with product managers, analysts, and clinical stakeholders to deliver actionable dashboards in Looker. Review peer code, contribute to engineering standards, and troubleshoot pipeline failures. Document technical designs and evaluate emerging tools through prototyping and hands-on testing.
What We're Looking For
- 5+ years of experience in data engineering or analytics engineering roles
- 2+ years working with healthcare data, including familiarity with clinical workflows and regulatory environments
- Proven work with HIPAA-compliant systems, including data classification and access governance
- Hands-on expertise with cloud data warehouses (BigQuery, Snowflake, or Redshift) and advanced SQL optimization
- Production experience with dbt, including model layering, testing, and documentation
- Deep knowledge of Apache Airflow for workflow orchestration and monitoring
- Experience building star or snowflake schemas and managing slowly changing dimensions
- Skill in delivering reports and dashboards using enterprise BI tools such as Looker or Power BI
- Python proficiency for pipeline development and API integrations (Pandas, PySpark)
- Practical experience with RAG pipelines and LLM integration frameworks
- Understanding of MLOps lifecycle components, including deployment and monitoring
- Experience with CI/CD systems for data workflows (e.g., GitHub Actions, dbt Cloud CI)
- Familiarity with data governance tools like OpenMetadata and principles such as data contracts and lineage
- Strong communication skills and ability to work independently while aligning with team goals
Nice to Have
- Experience with streaming data platforms like Kafka, Kinesis, or Pub/Sub, especially for clinical event data
- Knowledge of vector databases including Pinecone, Weaviate, FAISS, or Chroma
- Understanding of responsible AI practices in healthcare, such as bias evaluation and explainability
- Exposure to data observability platforms like Monte Carlo, Bigeye, or Soda
- Familiarity with data lakehouse architectures (Delta Lake, Iceberg, Hudi)
- Experience supporting SOC2 or HITRUST compliance efforts
- Working knowledge of semantic modeling tools such as Looker’s LookML or dbt Semantic Layer
- Background with population health, revenue cycle, or clinical quality metrics
- Experience deploying ML workloads using Kubernetes or containerized environments
Technology Environment
Our stack includes Google BigQuery, dbt (Core or Cloud), Airflow, Looker, Python, Pandas, PySpark, LangChain, LangGraph, LlamaIndex, GitHub Actions, OpenMetadata, Kafka, Kinesis, Pub/Sub, Pinecone, Weaviate, FAISS, Chroma, Monte Carlo, Bigeye, Soda, Delta Lake, Iceberg, Apache Hudi, and Kubernetes.
