Responsibilities
- Build & Scale Core Infrastructure - Design and implement backend systems that support large-scale ML workloads, including fine-tuning and reinforcement learning.
- Build distributed training and inference pipelines that are efficient, fault-tolerant, and observable.
- Develop internal developer tools and platforms that make it easier for ML engineers to train, evaluate, and deploy models.
- Cloud & Systems Engineering - Work on cloud-native systems using containers and orchestration (e.g., Kubernetes).
- Optimize systems for performance, reliability, and cost efficiency, especially for GPU-heavy workloads.
- Implement monitoring, logging, and observability for long-running training jobs and production services.
- Collaborate with ML Engineers - Partner closely with ML engineers to support evolving model architectures, training workflows, and evaluation needs.
- Translate ML requirements into scalable backend and infrastructure solutions.
Requirements
- 1–3 years of backend engineering experience, ideally working on production systems.
- Strong fundamentals in distributed systems, networking, and backend architecture.
- Experience building systems that scale under real load.
- Comfortable working in Python and/or Go (or similar backend languages).
- Excited to work on-site in San Francisco with a fast-moving early-stage team.
Nice to Have
- Experience with or exposure to ML infrastructure or ML platforms.
- Familiarity with GPU workloads, training pipelines, or inference systems.
- Experience with containerization and orchestration (Docker, Kubernetes).
- Contributions to or deep familiarity with ML infrastructure libraries such as: - Ray - vLLM - SGLang - or similar distributed ML systems
Work Arrangement
On-site — San Francisco