NVIDIA is hiring a Senior Site Reliability Engineer, DGX Cloud

About the Role

NVIDIA is looking for a Senior Site Reliability Engineer to join our DGX Cloud team. You will be responsible for maintaining the high-performance DGX Cloud clusters used by AI researchers and enterprise clients worldwide, focusing on the operational and reliability aspects of large-scale Kubernetes environments.

What You'll Do

  • Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting
  • Define SLOs/SLIs, monitor error budgets, and streamline reporting
  • Support services before they launch through system creation consulting, developing software tools, platforms and frameworks, capacity management, and launch reviews
  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health
  • Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds
  • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity
  • Lead triage and root-cause analysis of high-severity incidents
  • Practice balanced incident response and blameless postmortems
  • Participate in on-call rotation to support production services

What We're Looking For

  • BS in Computer Science or related technical field, or equivalent experience
  • 10+ years of experience operating production services
  • Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture
  • Experience with infrastructure automation tools (e.g., Terraform, Ansible, Chef, Puppet)
  • Proficiency in at least one high-level programming language (e.g., Python, Go)
  • In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards
  • Proficient knowledge of SRE principles, encompassing SLOs, SLIs, error budgets, and incident handling
  • Experience building and operating comprehensive observability stacks (monitoring, logging, tracing) using tools like OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, etc.

Nice to Have

  • Operating GPU-accelerated clusters with KubeVirt in production
  • Applying generative-AI techniques to reduce operational toil
  • Automating incidents with Shoreline or StackStorm

Technical Stack

  • Cloud: AWS, GCP, Azure, OCI
  • Infrastructure & Orchestration: Kubernetes, Terraform, Ansible, Chef, Puppet, KubeVirt
  • Languages: Python, Go
  • Platform: Linux
  • Observability: OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, Shoreline, StackStorm

Team & Environment

You will be part of a diverse, supportive environment where everyone is inspired to do their best work.

Required Skills
KubernetesAWSGCPAzureOCITerraformAnsibleChefPuppetPythonSite Reliability EngineeringDistributed SystemsInfrastructure as CodeCloud ComputingAutomation
Want to work from Thailand?

Join a remote network built for tech talent

Iglu gives you real employment in Southeast Asia — visa, work permit, and projects included. Pick what you work on, earn performance-based pay, and live where you want.

Legal employment in Thailand & Vietnam
Choose your own projects
Performance-based revenue sharing
Relocation support available
Join Iglu
200+ professionals worldwide
About company
NVIDIA
NVIDIA builds accelerated computing platforms and AI technologies that power advancements in areas such as generative AI, data centers, robotics, and digital twins.
All jobs at NVIDIA Visit website
Job Details
Category infrastructure
Posted 8 months ago