Oversee the creation and management of scalable, production-ready cloud infrastructure for machine learning tasks on AWS and GCP, incorporating GPU and TPU resources for training and inference.
Design and enhance continuous integration and deployment pipelines for machine learning models and platform services to ensure rapid, dependable, and secure production deployments.
Manage and advance low-latency infrastructure for real-time model inference, utilizing key-value stores and vector databases.
Establish and uphold monitoring standards for machine learning systems, covering model performance tracking, drift identification, capacity forecasting, and pipeline status metrics.
Engage in on-call duties, spearheading incident management and root-cause investigations for critical machine learning training and serving infrastructure.
Collaborate with data scientists and machine learning engineers to enhance platform usability, speed up model development cycles, and apply robust MLOps and SRE methodologies.
Promote operational excellence in machine learning infrastructure via automation, resilience engineering, disaster recovery strategies, and ongoing enhancements.

Bachelor's or Master's degree in Computer Science, Engineering, or a related quantitative discipline.
Minimum of 8 years in DevOps, SRE, or machine learning infrastructure roles, with at least 4 years dedicated to large-scale ML or AI systems.
Proficient programming abilities in Python, Scala, or Java for developing automation tools and platform utilities.
Extensive knowledge of Kubernetes and container orchestration on Google Cloud Platform (GKE) and/or Amazon Web Services (EKS).
Expertise with NoSQL databases or low-latency data storage solutions like Aerospike or comparable technologies.
Practical experience with data processing and orchestration tools such as Apache Spark, Apache Flink, Apache Airflow, and Kafka.
Background in constructing and maintaining CI/CD systems using Jenkins, GitLab Runner, or similar tools.
Familiarity with feature engineering platforms such as Chronon and model lifecycle management tools like MLflow.
Strong proficiency in infrastructure-as-code practices using Terraform or equivalent technologies.
Experience with observability platforms including Prometheus, Grafana, and Datadog.
Outstanding communication skills and ability to work effectively across diverse teams.

Health insurance coverage.
Equity awards program.
Life insurance policy.
Disability benefits.
Parental leave provisions.
Wellness program benefits.
Paid time off allowances.
Worldwide access to mental health and financial wellness resources and support.
Local benefits encompassing statutory and optional offerings such as healthcare (medical, dental, vision), life, accident, disability, commuter, and retirement plans (401(k)/pension).
Support for taking time off in line with local leave policies and personal requirements to accommodate evolving work-life needs.

Hybrid

The company encourages an inclusive and cooperative workplace where teams typically work onsite from Monday to Thursday, with Fridays offering remote flexibility, except for roles mandated to be in-office full-time or in locations with a five-day policy.
The organization welcomes candidates from all backgrounds and offers reasonable accommodations per applicable laws; inquiries should be directed to EmployeeRelations@Roku.com during the hiring process.

Roku is hiring a Senior Machine Learning Engineer, DevOps/SRE