Build & Scale Core Infrastructure - Design and implement backend systems that support large-scale ML workloads, including fine-tuning and reinforcement learning.
Build distributed training and inference pipelines that are efficient, fault-tolerant, and observable.
Develop internal developer tools and platforms that make it easier for ML engineers to train, evaluate, and deploy models.
Cloud & Systems Engineering - Work on cloud-native systems using containers and orchestration (e.g., Kubernetes).
Optimize systems for performance, reliability, and cost efficiency, especially for GPU-heavy workloads.
Implement monitoring, logging, and observability for long-running training jobs and production services.
Collaborate with ML Engineers - Partner closely with ML engineers to support evolving model architectures, training workflows, and evaluation needs.
Translate ML requirements into scalable backend and infrastructure solutions.

1–3 years of backend engineering experience, ideally working on production systems.
Strong fundamentals in distributed systems, networking, and backend architecture.
Experience building systems that scale under real load.
Comfortable working in Python and/or Go (or similar backend languages).
Excited to work on-site in San Francisco with a fast-moving early-stage team.

Experience with or exposure to ML infrastructure or ML platforms.
Familiarity with GPU workloads, training pipelines, or inference systems.
Experience with containerization and orchestration (Docker, Kubernetes).
Contributions to or deep familiarity with ML infrastructure libraries such as: - Ray - vLLM - SGLang - or similar distributed ML systems

On-site — San Francisco

Rockstar is hiring a Backend Software Engineer (ML Infra)