NVIDIA is hiring a Senior Site Reliability Engineer, DGX Cloud

Responsibilities

  • Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting
  • Define SLOs/SLIs, monitor error budgets, and streamline reporting
  • Support services before they launch through system creation consulting, developing software tools, platforms and frameworks, capacity management, and launch reviews
  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health
  • Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds
  • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity
  • Lead triage and root-cause analysis of high-severity incidents
  • Practice balanced incident response and blameless postmortems
  • Participate in on-call rotation to support production services
Required Skills
KubernetesAWSGCPMicrosoft AzureTerraformAnsiblePuppetPythonDistributed SystemsInfrastructure as CodeCloud ComputingAutomation
Want to work from Thailand?

Join a remote network built for tech talent

Iglu gives you real employment in Southeast Asia — visa, work permit, and projects included. Pick what you work on, earn performance-based pay, and live where you want.

Legal employment in Thailand & Vietnam
Choose your own projects
Performance-based revenue sharing
Relocation support available
Join Iglu
200+ professionals worldwide
About company
NVIDIA
NVIDIA builds accelerated computing platforms and AI technologies that power advancements in areas such as generative AI, data centers, robotics, and digital twins.
All jobs at NVIDIA Visit website
Job Details
Category infrastructure
Posted 10 months ago