North America

UnitedHealth Group / Optum is hiring a Lead Site Reliability Engineer

About the Role

UnitedHealth Group / Optum is looking for a Lead Site Reliability Engineer to operate at the intersection of development and operations. You will ensure systems meet stringent production SLAs while empowering development teams to ship code faster and safer. Our mission is to help people live healthier lives and make the health system work better for everyone, supported by a culture guided by inclusion, talented peers, and comprehensive benefits.

What You'll Do

  • Champion the philosophy of 'automation first' to eliminate manual, repetitive operational tasks (toil).
  • Design and implement robust automation solutions to allow engineers to focus on strategic projects.
  • Implement and manage comprehensive monitoring, logging, and alerting systems to provide deep visibility into application performance and infrastructure health.
  • Develop dashboards and tools that enable rapid detection and resolution of incidents.
  • Act as a catalyst for DevOps culture and practices across development teams.
  • Provide the tools, infrastructure, and guardrails necessary to accelerate the software delivery lifecycle securely and reliably.
  • Lead the design and implementation of automated operational workflows for existing services and new service onboarding, including provisioning, deployment, scaling, and self-healing capabilities.
  • Oversee incident response management, lead root cause analyses (post-mortems), and ensure action items are completed to prevent recurrence.
  • Manage and optimize cloud infrastructure costs and efficiency using Infrastructure as Code (IaC) principles.

What We're Looking For

  • Undergraduate degree or equivalent experience.
  • 10+ years of experience in SRE, DevOps, Software Engineering, or a related operational capacity within a high-traffic production environment.
  • Extensive experience in managing critical production systems, incident response, and leading post-mortem processes.
  • Proven experience managing infrastructure and applications within a major public cloud environment (AWS, Azure, or GCP) at scale.
  • Proven solid track record of automating complex, manual operational processes and improving engineering efficiency.
  • Hands-on experience implementing and managing monitoring and logging stacks (e.g., Prometheus, Grafana, ELK stack/Elasticsearch, Datadog, Splunk).
  • Solid experience with Infrastructure as Code tools such as Terraform or CloudFormation, and configuration management tools (Ansible, Chef, or Puppet).
  • Proficiency in programming languages (e.g., Python, Go, Ruby, or Java/C#) used for automation, tooling development, and services management.
  • Proven expertise in cloud platforms (e.g., AWS services such as EC2, S3, RDS, Lambda, EKS/ECS).
  • Proven mandatory expertise in Docker and Kubernetes for container orchestration and management.
  • Proven expertise in building and maintaining robust CI/CD pipelines (e.g., GHA, Jenkins, GitLab CI, Azure DevOps) and strong Git practices.

Technical Stack

  • Monitoring & Logging: Prometheus, Grafana, ELK stack/Elasticsearch, Datadog, Splunk
  • Infrastructure as Code: Terraform, CloudFormation
  • Configuration Management: Ansible, Chef, Puppet
  • Languages: Python, Go, Ruby, Java, C#
  • AWS Services: EC2, S3, RDS, Lambda, EKS/ECS
  • Containers & Orchestration: Docker, Kubernetes
  • CI/CD: GHA, Jenkins, GitLab CI, Azure DevOps
  • Version Control: Git

UnitedHealth Group is committed to mitigating environmental impact and enabling and delivering equitable care that addresses health disparities.

Required Skills
PrometheusGrafanaELK stackTerraformAWSAzureGCPAnsibleChefPuppetincident responsepost-mortemautomation PrometheusGrafanaELK stackTerraformAWSAzureGCPAnsibleChefPuppetincident responsepost-mortemautomation
Looking for a remote dev community?

200+ professionals, 37 countries, one network

Working remotely doesn't mean working alone. Iglu connects you with developers, designers, and digital experts worldwide. Collaborate, learn, and grow together.

Global professional network
Knowledge sharing & collaboration
Regular community events
Cross-project opportunities
Join the community
37 countries represented
About company
UnitedHealth Group / Optum
Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. It connects people with care, pharmacy benefits, data and resources.
All jobs at UnitedHealth Group / Optum Visit website
Job Details
Department Information Technology
Category infrastructure
Posted 2 months ago