We're looking for a Senior Site Reliability Engineer to help sustain and evolve the infrastructure behind one of the world's most visited platforms. You'll play a key role in ensuring reliability, performance, and scalability across a vast, distributed system used by millions daily. This position is central to maintaining operational excellence while advancing automation, observability, and resilience.
What You’ll Do
- Manage and optimize production systems through deployment, configuration, and ongoing maintenance using modern DevOps practices
- Design and implement automation for provisioning, scaling, and monitoring services using tools like Puppet and Kubernetes
- Collaborate with engineering teams to shape scalable architectures and guide best practices in system design
- Respond to incidents as part of a rotating on-call schedule, leading diagnosis, resolution, and post-mortem analysis to strengthen system resilience
- Diagnose complex issues across layers—from network protocols to application performance—using deep knowledge of TCP/IP, HTTP, TLS, and DNS
- Contribute to a culture of continuous improvement by identifying inefficiencies and driving automation initiatives
- Mentor team members and share expertise across a globally distributed, asynchronous work environment
- Occasionally travel 1–2 times per year for team gatherings and in-person collaboration
What We’re Looking For
- At least six years of experience in site reliability, systems engineering, or DevOps roles within large-scale environments
- Strong scripting ability in Python, Bash, or similar, with hands-on experience in configuration management (especially Puppet)
- Proven skill in Linux system administration, particularly on Debian-based systems, including package management and kernel-level troubleshooting
- Deep understanding of distributed systems, caching architectures, and performance optimization
- Experience with incident response, root cause analysis, and implementing preventive measures
- Excellent written and verbal communication skills in English, with the ability to work independently across time zones
Nice to Have
- Background in tuning Linux kernels for high-throughput services
- Familiarity with caching proxies such as Varnish, Nginx, or Envoy
- Experience with monitoring and alerting stacks like Prometheus and Grafana
- Contributions to open-source projects or active participation in developer communities
- Knowledge of PHP, HHVM, Redis, or MediaWiki ecosystems
- Experience defining and managing service-level objectives (SLOs) across teams
Our Environment
We operate as a remote-first organization with team members across more than 40 countries. All code, configuration, and documentation are publicly accessible, reflecting our commitment to open-source principles. Our culture values diversity, transparency, and continuous learning. We prioritize equitable compensation, inclusive hiring, and accessibility for all applicants and employees.


