About the Role

The role involves combining software engineering and operations to build and maintain reliable, scalable systems. Responsibilities include incident management, automation, monitoring, and improving system performance.

Responsibilities

Design and implement monitoring solutions for system health and performance
Respond to and resolve production incidents promptly
Develop automation tools to reduce manual operational tasks
Collaborate with development teams to improve system reliability
Participate in on-call rotations for critical system support
Analyze system failures and implement preventive measures
Optimize system performance and scalability
Maintain and improve deployment pipelines
Ensure high availability of services and infrastructure
Troubleshoot complex distributed systems issues
Support incident post-mortem processes with actionable recommendations
Implement and manage configuration management tools
Work on capacity planning and resource forecasting
Enforce security and compliance standards in production systems
Contribute to disaster recovery planning and execution
Develop and maintain technical documentation
Drive adoption of best practices in reliability engineering
Evaluate and integrate new technologies for operational efficiency
Monitor and report on service level objectives and error budgets
Collaborate on system architecture improvements
Support cloud infrastructure management and optimization
Promote a culture of blameless post-mortems and continuous learning
Assist in code reviews with a focus on operational impact
Ensure systems meet defined reliability and uptime targets
Participate in system design reviews for new features

Compensation

Competitive salary and benefits package

Work Arrangement

Hybrid work model with flexibility for remote and office-based work

Team

Collaborative engineering team focused on scalable systems and operational excellence

Why This Role Matters

This position plays a critical role in maintaining the stability and performance of large-scale software systems. The engineer ensures that services remain available and responsive under varying loads and helps bridge the gap between development and operations.

Technology Stack

The team uses modern cloud infrastructure, Kubernetes for orchestration, Prometheus and Grafana for monitoring, GitLab for CI/CD, and a microservices-based architecture built with Java and Go.

Growth Opportunities

Engineers are encouraged to lead initiatives, mentor peers, and contribute to cross-team projects. There are clear pathways for technical and leadership advancement.

Visa sponsorship available for qualified candidates

Guidewire is hiring a Site Reliability Engineer

About the Role

Responsibilities

Compensation

Work Arrangement

Team

Why This Role Matters

Technology Stack

Growth Opportunities

Similar Jobs

Cloud Engineer - Platform Engineering

DevOps Engineer (Mid level)

Senior DevOps Engineer

Senior Site Reliability Engineer

DevOps Engineer III

Senior Site Reliability Engineer - B2B

Related Articles

Platform Engineering: Kubernetes for All

Developer Experience Platform: Lessons from Europe

Kubernetes Remote Jobs: AI & Cloud-Native Careers in 2026