Join a globally distributed engineering team as a Senior Site Reliability Engineer, focused on maintaining and evolving a high-performance cloud platform across the APAC region. You will play a central role in ensuring system reliability, operational efficiency, and seamless scalability through automation, proactive monitoring, and strategic planning.
What You’ll Do
- Lead the ongoing operation and tuning of our cloud infrastructure, ensuring systems meet defined availability, performance, and scalability targets
- Collaborate with engineering squads to define service level objectives and translate reliability goals into technical roadmaps using agile practices
- Investigate system anomalies, conduct root cause analysis, and implement long-term fixes to strengthen resilience
- Optimize system performance by analyzing metrics from operating systems, applications, and distributed services
- Build automation frameworks to streamline deployment, configuration, and incident response workflows
- Design and maintain monitoring dashboards, alerting systems, and KPIs using Prometheus, Grafana, and supporting tools
- Participate in an on-call rotation with defined hours (4:00am – 4:00pm UTC), ensuring timely resolution of critical incidents
- Document postmortems in a blame-free culture and maintain up-to-date runbooks for operational consistency
- Support expansion into multi-region and multi-cloud environments with a focus on fault tolerance and operational simplicity
- Balance infrastructure efficiency with cost optimization while maintaining service quality
- Work with commercial teams to align platform capabilities with business growth initiatives
- Coordinate security testing activities, including test planning, environment setup, and provider coordination
- Lead improvements in team processes, communication practices, and engineering standards
- Guide software upgrades and platform enhancements to improve service reliability and functionality
What We’re Looking For
- Proven experience in site reliability or platform engineering roles
- Solid understanding of cloud infrastructure, SLAs, SLOs, and service level indicators
- Strong communication and leadership skills with a collaborative mindset
- Experience improving operational workflows and performance metrics
- Background in software design, scripting, and automation using Go, Python, or similar
- Hands-on work with Kubernetes in production environments
- Experience managing infrastructure on AWS, including EKS, VPCs, load balancing, and security configurations
- Familiarity with operating MongoDB and Redis clusters at scale
- Proficiency with Linux system administration and troubleshooting
- Experience with Prometheus, Grafana, Thanos, and centralized logging systems
- Knowledge of networking fundamentals: TCP/IP, DNS, HTTP, TLS, UDP, routing, peering, and NAT
- Experience with infrastructure-as-code tools such as Terraform and Helm
- Willingness to participate in on-call support during specified hours
How We Work
This is a fully remote role open to candidates in Hong Kong and beyond, with complete flexibility in working hours. We operate on trust, autonomy, and clear communication. Our culture values honesty, responsibility, and independent thinking. We encourage experimentation, learn openly from setbacks, and prioritize inclusive collaboration.
We offer unlimited paid holidays, generous parental leave, employee share options, wellbeing support, and paid volunteering days. Creativity thrives when people have control over their time — we’re built around that principle.
Our Commitment
We are an equal opportunities employer. We do not discriminate based on age, gender, disability, race, religion, sexual orientation, or marital status. We believe diverse perspectives strengthen our team and are committed to fostering an inclusive environment where everyone can contribute meaningfully.


