Responsibilities
- Oversee the creation and management of scalable, production-ready cloud infrastructure for machine learning tasks on AWS and GCP, incorporating GPU and TPU resources for training and inference.
- Design and enhance continuous integration and deployment pipelines for machine learning models and platform services to ensure rapid, dependable, and secure production deployments.
- Manage and advance low-latency infrastructure for real-time model inference, utilizing key-value stores and vector databases.
- Establish and uphold monitoring standards for machine learning systems, covering model performance tracking, drift identification, capacity forecasting, and pipeline status metrics.
- Engage in on-call duties, spearheading incident management and root-cause investigations for critical machine learning training and serving infrastructure.
- Collaborate with data scientists and machine learning engineers to enhance platform usability, speed up model development cycles, and apply robust MLOps and SRE methodologies.
- Promote operational excellence in machine learning infrastructure via automation, resilience engineering, disaster recovery strategies, and ongoing enhancements.
Requirements
- Bachelor's or Master's degree in Computer Science, Engineering, or a related quantitative discipline.
- Minimum of 8 years in DevOps, SRE, or machine learning infrastructure roles, with at least 4 years dedicated to large-scale ML or AI systems.
- Proficient programming abilities in Python, Scala, or Java for developing automation tools and platform utilities.
- Extensive knowledge of Kubernetes and container orchestration on Google Cloud Platform (GKE) and/or Amazon Web Services (EKS).
- Expertise with NoSQL databases or low-latency data storage solutions like Aerospike or comparable technologies.
- Practical experience with data processing and orchestration tools such as Apache Spark, Apache Flink, Apache Airflow, and Kafka.
- Background in constructing and maintaining CI/CD systems using Jenkins, GitLab Runner, or similar tools.
- Familiarity with feature engineering platforms such as Chronon and model lifecycle management tools like MLflow.
- Strong proficiency in infrastructure-as-code practices using Terraform or equivalent technologies.
- Experience with observability platforms including Prometheus, Grafana, and Datadog.
- Outstanding communication skills and ability to work effectively across diverse teams.
Nice to Have
- Background in the Advertising industry is advantageous.
Benefits
- Health insurance coverage.
- Equity awards program.
- Life insurance policy.
- Disability benefits.
- Parental leave provisions.
- Wellness program benefits.
- Paid time off allowances.
- Worldwide access to mental health and financial wellness resources and support.
- Local benefits encompassing statutory and optional offerings such as healthcare (medical, dental, vision), life, accident, disability, commuter, and retirement plans (401(k)/pension).
- Support for taking time off in line with local leave policies and personal requirements to accommodate evolving work-life needs.
Work Arrangement
Hybrid
Other
- The company encourages an inclusive and cooperative workplace where teams typically work onsite from Monday to Thursday, with Fridays offering remote flexibility, except for roles mandated to be in-office full-time or in locations with a five-day policy.
- The organization welcomes candidates from all backgrounds and offers reasonable accommodations per applicable laws; inquiries should be directed to EmployeeRelations@Roku.com during the hiring process.