Remote (Global)

Endgame Systems, LLC is hiring a Senior Site Reliability Engineer (Resilience) - Platform Resilience

About the Role

As a Senior Site Reliability Engineer (Resilience), you'll drive the evolution of large-scale, multi-cloud systems by designing automation and improving platform reliability. You'll work across engineering teams to build resilient infrastructure for cloud-hosted and serverless services, ensuring systems scale efficiently and operate consistently under real-world demands.

What You'll Do

  • Design and implement automation to streamline system engineering workflows and strengthen platform stability.
  • Scale global infrastructure to meet growing demand through code-driven tooling and maintainable systems.
  • Lead incident response and problem management efforts to reduce recurring customer impact and improve resolution efficiency.
  • Collaborate across time zones in a follow-the-sun on-call model, primarily during regular working hours.
  • Foster a culture of shared ownership, operational rigor, and continuous learning within engineering teams.

Requirements

  • Demonstrated experience applying engineering principles to improve platform reliability and reduce operational toil.
  • Customer-focused mindset with the ability to assess and resolve operational issues through an SRE lens.
  • Strong software engineering background enabling effective collaboration on system design and implementation.

Preferred Qualifications

  • Hands-on experience with public cloud platforms and managed Kubernetes services.
  • Proven work with Infrastructure-as-Code tools such as Crossplane or Terraform in SaaS environments.
  • Experience operating large-scale Kubernetes clusters across multiple cloud providers.
  • Proficiency in Golang or other programming languages for building system-level tools.
  • Familiarity with container technologies like Docker and distributed Linux environments.
  • Track record improving alerting, incident response, and observability systems using tools like Prometheus, Influx, Graphite, or the Elastic Stack.
  • Experience mentoring engineers and promoting knowledge sharing in distributed teams.
  • Background in inclusive communication practices that strengthen team and partner relationships.
  • Remote work experience in self-directed, globally distributed teams.

Benefits

  • Compensation aligned with role impact, not prior salary history.
  • Comprehensive health coverage for employees and dependents in many regions.
  • Flexible work arrangements with support for remote and asynchronous collaboration.
  • Generous annual vacation allowance.
  • Up to $2000 in matched donations for charitable giving or community service.
  • 40 hours annually dedicated to volunteer activities.
  • Minimum of 16 weeks of parental leave.
  • Commitment to diversity, equity, and inclusion across a global workforce.
  • Clear pathways for professional development regardless of age, background, or tenure.
Required Skills
KubernetesTerraformDockerPrometheusElastic StackGolangLinuxCrossplaneInfluxGraphiteSite Reliability EngineeringPlatform ResilienceSRECloud InfrastructureObservability KubernetesTerraformDockerPrometheusElastic StackGolangLinuxCrossplaneInfluxGraphiteSite Reliability EngineeringPlatform ResilienceSRECloud InfrastructureObservability
Want to work from Thailand?

Join a remote network built for tech talent

Iglu gives you real employment in Southeast Asia — visa, work permit, and projects included. Pick what you work on, earn performance-based pay, and live where you want.

Legal employment in Thailand & Vietnam
Choose your own projects
Performance-based revenue sharing
Relocation support available
Join Iglu
200+ professionals worldwide
About company
Endgame Systems, LLC
Endgame Systems, LLC provides consulting services related to Elastic technology to Government agencies with heightened security needs. It is a wholly-owned subsidiary of Elastic, focused on Government services.
All jobs at Endgame Systems, LLC Visit website
Job Details
Department Platform Engineering
Category infrastructure
Posted 17 days ago