Shape the foundation of data intelligence in biomedicine by leading the design and implementation of robust, scalable data systems. As a Senior Data Engineer, you will own the architecture of end-to-end pipelines that transform fragmented, heterogeneous scientific data—spanning genomics, chemical assays, clinical findings, and more—into structured, reliable inputs for machine learning and research.
What You’ll Do
- Design and evolve schema-first data models that unify noisy, semi-structured sources into coherent, versioned, and interoperable datasets.
- Build and maintain cloud-native infrastructure across storage, processing, and streaming layers, prioritizing scalability, correctness, and operational resilience.
- Develop pipelines supporting both batch and real-time data access, tailored to the needs of ML training, evaluation, and inference workflows.
- Define and enforce standards for data quality, lineage, validation, and provenance to support scientific reproducibility.
- Collaborate closely with ML engineers, scientists, and product leads to translate research questions into durable, reusable data abstractions.
- Provide technical leadership through architecture reviews, mentorship, and cross-functional guidance, balancing performance, cost, and long-term maintainability.
- Proactively identify and mitigate systemic risks in data integrity, scalability bottlenecks, and operational complexity.
What We’re Looking For
- 5+ years of experience building and operating production data systems, with clear ownership of platform-level decisions.
- Strong command of Python and modern data engineering practices, with experience in distributed processing frameworks such as Spark, Beam, or Ray.
- Deep familiarity with cloud platforms (AWS, GCP, or Azure), including storage, compute, networking, and security primitives.
- Proven track record designing large-scale pipelines that support ML workloads, from feature generation to reproducible training environments.
- Experience with orchestration tools like Airflow or Dagster, and streaming platforms such as Kafka, Pub/Sub, or Kinesis.
- Ability to make sound architectural trade-offs and communicate them effectively across technical and scientific domains.
Nice to Have
- Background working with biomedical or life science data—including omics, molecular representations, toxicology, or clinical datasets.
- Experience with ontology-driven modeling or schema evolution in scientific contexts.
- Proficiency in infrastructure-as-code tools like Terraform, and systems such as Docker and Kubernetes.
- Experience in fast-moving, research-intensive, or early-stage environments.
- Contributions to open-source projects or technical publications.
Environment & Culture
This role offers substantial autonomy and ownership in a technically deep, low-ego environment. You’ll work in a culture centered on learning, precision, and long-term thinking. The team supports flexible remote or hybrid arrangements, with a focus on sustainable, impactful engineering. You’ll help build the data backbone of an AI-powered biomedical platform, directly influencing how raw scientific data becomes machine-understandable knowledge.
Compensation includes competitive pay, meaningful equity participation, and benefits aligned with long-term contribution.
