About the Role
The role involves combining software engineering and operations to build and maintain reliable, scalable systems. Responsibilities include incident management, automation, monitoring, and improving system performance.
Responsibilities
- Design and implement monitoring solutions for system health and performance
- Respond to and resolve production incidents promptly
- Develop automation tools to reduce manual operational tasks
- Collaborate with development teams to improve system reliability
- Participate in on-call rotations for critical system support
- Analyze system failures and implement preventive measures
- Optimize system performance and scalability
- Maintain and improve deployment pipelines
- Ensure high availability of services and infrastructure
- Troubleshoot complex distributed systems issues
- Support incident post-mortem processes with actionable recommendations
- Implement and manage configuration management tools
- Work on capacity planning and resource forecasting
- Enforce security and compliance standards in production systems
- Contribute to disaster recovery planning and execution
- Develop and maintain technical documentation
- Drive adoption of best practices in reliability engineering
- Evaluate and integrate new technologies for operational efficiency
- Monitor and report on service level objectives and error budgets
- Collaborate on system architecture improvements
- Support cloud infrastructure management and optimization
- Promote a culture of blameless post-mortems and continuous learning
- Assist in code reviews with a focus on operational impact
- Ensure systems meet defined reliability and uptime targets
- Participate in system design reviews for new features
Compensation
Competitive salary and benefits package
Work Arrangement
Hybrid work model with flexibility for remote and office-based work
Team
Collaborative engineering team focused on scalable systems and operational excellence
Why This Role Matters
This position plays a critical role in maintaining the stability and performance of large-scale software systems. The engineer ensures that services remain available and responsive under varying loads and helps bridge the gap between development and operations.
Technology Stack
The team uses modern cloud infrastructure, Kubernetes for orchestration, Prometheus and Grafana for monitoring, GitLab for CI/CD, and a microservices-based architecture built with Java and Go.
Growth Opportunities
Engineers are encouraged to lead initiatives, mentor peers, and contribute to cross-team projects. There are clear pathways for technical and leadership advancement.
Visa sponsorship available for qualified candidates


