Role Overview
We’re building advanced machine learning systems and need a DevOps engineer who specializes in ML infrastructure to support our research and development efforts. You will play a central role in shaping the backbone of our computational environment, ensuring that training and inference workflows run efficiently at scale.
Key Responsibilities
- Design, optimize, and maintain cloud and distributed compute infrastructure tailored for machine learning workloads
- Automate end-to-end ML pipelines including data processing, model training, validation, and deployment
- Ensure reproducibility and traceability across experiments and model versions
- Handle large-scale datasets, often in the terabyte range, with efficient storage and access patterns
- Implement and refine CI/CD practices specific to machine learning systems
- Monitor deployed models for performance degradation and data drift
- Partner closely with machine learning engineers, software developers, and platform teams to align infrastructure with research goals
Required Qualifications
- Proven experience managing cloud environments and compute clusters
- Strong background in scaling infrastructure to support demanding ML workloads
- Proficiency in Linux system administration
- Hands-on experience supporting development, training, and production environments in the cloud
- Track record of automating workflows across multiple cloud providers
- Deep understanding of reliability, scalability, and automation throughout the machine learning lifecycle
Technology Environment
Our stack spans multiple cloud platforms, GPU-accelerated infrastructure, distributed computing frameworks, ML-focused CI/CD tooling, data ingestion pipelines, and model monitoring systems. The role requires fluency in Linux environments and a strong systems mindset.