Remote (Country)

Articul8 AI is hiring a Senior Site Reliability Engineer (SRE) - (Brazil)

About the Role

The role involves designing, implementing, and maintaining reliable systems by combining software engineering and operational practices to support large-scale distributed services.

Responsibilities

  • Design and deploy scalable infrastructure solutions
  • Monitor system performance and respond to incidents
  • Implement automated recovery and self-healing mechanisms
  • Collaborate with development teams to improve service reliability
  • Define and track key reliability metrics
  • Troubleshoot complex production issues
  • Optimize system availability and latency
  • Develop tools for operational efficiency
  • Maintain documentation for systems and processes
  • Support deployment pipelines and CI/CD workflows
  • Enforce security and compliance standards
  • Participate in on-call rotations
  • Conduct post-incident reviews
  • Improve observability through logging and alerting
  • Reduce technical debt in production systems
  • Evaluate new technologies for operational impact
  • Drive incident response coordination
  • Ensure capacity planning meets demand
  • Promote best practices in reliability engineering
  • Integrate feedback loops for continuous improvement

Nice to Have

  • Master's degree in a technical field
  • Experience with large-scale microservices architectures
  • Contributions to open-source projects
  • Certifications in cloud or DevOps platforms
  • Background in machine learning infrastructure
  • Experience with service-level objectives and error budgets
  • Knowledge of chaos engineering principles
  • Prior work in AI-driven technology environments
  • Leadership in cross-functional initiatives
  • Published technical content or conference talks

Compensation

Competitive salary and benefits package

Work Arrangement

Remote, based in Brazil

Team

Collaborative engineering team focused on scalable systems

Why This Role Matters

  • This position plays a critical role in ensuring the stability and performance of core services.
  • You will directly influence system design and operational resilience.

What We Expect

  • Proactive problem solving and ownership of system health.
  • A mindset focused on automation, measurement, and continuous improvement.

Not applicable

Required Skills
AWSGCPMicrosoft AzurePythonGoBashTerraformCloudFormationDockerKubernetesInfrastructure as CodeCloud InfrastructureMonitoring
About company
Articul8 AI
Articul8 AI creates exceptional AI products that exceed customer expectations.
All jobs at Articul8 AI Visit website
Job Details
Category infrastructure
Posted a year ago