As a Senior Site Reliability Engineer, you will shape the future of system reliability and observability across a global cloud infrastructure. Your work will center on building and refining scalable solutions that ensure high availability, performance, and proactive incident response across distributed services.
Key Responsibilities
- Design, write, and deploy code that enhances system reliability, setting benchmarks for quality and maintainability.
- Review code and production changes with actionable insights to strengthen system integrity.
- Lead root cause analysis and resolution of complex architectural and operational issues.
- Develop and manage AI-powered monitoring systems using Python and observability platforms to detect and prevent outages.
- Use Terraform and Infrastructure as Code practices to automate visibility and issue detection across environments.
- Operate and optimize workloads on Google Cloud Platform, ensuring efficient scaling, cost control, and performance.
- Partner with development teams to embed reliability into the software lifecycle using platform engineering principles.
- Create automated solutions for monitoring, tuning, and disaster recovery, leveraging AI to reduce toil.
- Diagnose and resolve issues across development, testing, and production environments.
- Conduct post-incident reviews and implement safeguards to prevent recurrence.
- Enforce security standards across infrastructure, supporting audits and compliance efforts.
- Contribute to capacity planning by analyzing trends and advising on resource needs.
- Optimize system performance through profiling, tuning, and metric monitoring.
- Maintain and test disaster recovery protocols to ensure business continuity.
- Document system designs, analyses, and procedures, and help improve team-wide design practices.
Qualifications
A bachelor’s degree in Computer Science, Engineering, Mathematics, or equivalent experience is required. You should have at least three years in roles such as SRE, DevOps, or Software Engineering.
Essential skills include proficiency with observability tools, cloud platforms—especially GCP and Kubernetes—strong Python programming, experience with both relational and document databases, and the ability to debug, optimize, and automate systems. You must also demonstrate strong problem-solving abilities and effective communication in high-pressure settings.
Preferred qualifications include experience with agentic AI and MCP development, Terraform Provider development, and working with Dynatrace SaaS.
Technology Environment
Python, Terraform, Google Cloud Platform (GCP), Kubernetes, Dynatrace SaaS, AI-driven observability, Infrastructure as Code (IaC), relational and document databases.
Work Environment
This position is remote and supports a global team structure. You’ll have the flexibility to work from any location while contributing to a large-scale, distributed engineering ecosystem.


