Hyderabad or Bengaluru

DigitalOcean is hiring a Senior Cloud Support Engineer

DigitalOcean is looking for a Senior Cloud Support Engineer to serve as the definitive technical authority for resolving the most complex customer challenges, particularly around Kubernetes and GPU/GradientAI workloads. In this role, you will bridge deep support expertise with solutions architecture, designing sophisticated cloud infrastructure while maintaining a customer-first mentality and participating in an operational on-call rotation for critical incidents.

What You'll Do

  • Serve as the ultimate escalation point for the most complex, business-critical customer issues across Kubernetes, GPU/GradientAI, and AI/ML infrastructure.
  • Architect enterprise-grade solutions for customers building large-scale AI/ML workloads, including multi-cluster Kubernetes deployments and distributed GPU training infrastructure.
  • Lead technical discovery and solution design for strategic accounts, conducting deep-dive architectural reviews and performance optimization workshops.
  • Drive resolution of systemic technical challenges by identifying patterns and partnering with Engineering to implement platform-level improvements.
  • Act as a trusted technical advisor to our highest-value customers and strategic partners, building deep relationships with their technical teams.
  • Design and deliver Professional Services engagements for enterprise customers requiring sophisticated AI/ML infrastructure implementations.
  • Conduct executive technical briefings and workshops that articulate DigitalOcean's platform capabilities and architectural best practices.
  • Mentor and develop IC1-IC3 engineers through structured coaching, technical reviews, and pair troubleshooting sessions.
  • Design and implement support frameworks including escalation workflows, troubleshooting methodologies, automation tools, and operational best practices.
  • Create authoritative technical documentation including architectural reference guides, troubleshooting runbooks, and customer-facing solution guides.
  • Lead critical incident response for platform-wide or high-impact customer issues, coordinating cross-functional war rooms.

What We're Looking For

  • 7+ years of progressive experience in technical support, solutions engineering, DevOps, or site reliability engineering roles with consistent demonstration of technical leadership.
  • 5+ years in senior technical customer-facing roles with proven ability to manage enterprise customer relationships and complex technical engagements.
  • Expert-level Kubernetes knowledge: Production-scale architecture design, cluster operations, advanced troubleshooting, performance optimization, security hardening, and networking.
  • Deep GPU/AI/ML infrastructure expertise: Multi-GPU and multi-node training, distributed computing frameworks, GPU resource management, inference optimization, and production ML deployment patterns.
  • Advanced understanding of production AI/ML pipelines including model training, optimization, deployment, and monitoring at scale.
  • Extensive experience with major ML frameworks (PyTorch, TensorFlow, Hugging Face) including distributed training strategies and production deployment patterns.
  • Expertise in GPU optimization techniques: CUDA programming concepts, TensorRT, vLLM, model quantization, and inference performance tuning.
  • Deep knowledge of MLOps practices: CI/CD for ML, model versioning, experiment tracking, feature stores, and production monitoring.
  • Experience with large-scale distributed AI/ML workloads including data parallelism, model parallelism, and mixed-precision training.
  • Proven experience designing fault-tolerant, scalable cloud architectures with deep consideration for cost optimization, security, compliance, and operational excellence.
  • Expert-level Linux system administration: Kernel tuning, performance profiling, security hardening, advanced troubleshooting, and automation.
  • Advanced networking expertise: Deep understanding of TCP/IP, routing protocols, load balancing, CDNs, VPNs, network security, and troubleshooting complex network issues.
  • Strong programming skills in Python with experience in at least one additional systems language (Go, Rust, C++, or similar).
  • Extensive experience with infrastructure-as-code (Terraform, CloudFormation, Pulumi) and configuration management tools.
  • Exceptional communication abilities: Can translate highly complex technical concepts into clear, actionable guidance for audiences ranging from junior engineers to C-level executives.
  • Demonstrated leadership capabilities including mentoring team members, leading cross-functional initiatives, and influencing without direct authority.
  • Strong consultative approach: Ability to discover underlying customer needs, challenge assumptions respectfully, and craft solutions that balance technical excellence with business pragmatism.
  • Track record of driving organizational improvement through process design, automation, documentation, and strategic initiatives.

Nice to Have

  • Bare Metal infrastructure expertise: Server provisioning, hardware troubleshooting, BIOS/firmware management, RAID configuration, and performance tuning.
  • Advanced networking knowledge: BGP, VLANs, network automation, traffic engineering, and datacenter networking concepts.
  • Kubernetes certifications: CKA (Certified Kubernetes Administrator), CKAD, or CKS (Certified Kubernetes Security Specialist).
  • Advanced cloud certifications: AWS Solutions Architect Professional, GCP Professional Cloud Architect, Azure Solutions Architect Expert.
  • GPU/AI certifications: NVIDIA DLI certifications, CUDA programming certifications, or similar specialized credentials.
  • Open-source contributions to AI/ML projects, Kubernetes ecosystem, or infrastructure tools.
  • Published technical content: Blog posts, whitepapers, solution guides, or technical documentation demonstrating thought leadership.
  • Speaking experience at technical conferences, meetups, or webinars on topics related to cloud infrastructure, AI/ML, or DevOps.
  • Active participation in technical communities (CNCF, Kubernetes SIGs, AI/ML forums, cloud-native communities).
  • Experience with observability platforms: Prometheus, Grafana, Datadog, New Relic, or similar monitoring/alerting systems.
  • Multi-cloud or hybrid-cloud architecture experience: Designing solutions that span AWS, GCP, Azure, and on-premises infrastructure.
  • Experience with DigitalOcean or Paperspace products as a user or customer.
  • Database expertise: Experience with both relational (PostgreSQL, MySQL) and NoSQL (MongoDB, Redis) databases at scale.
  • Security & compliance knowledge: Experience with SOC2, HIPAA, GDPR, or other compliance frameworks in cloud environments.

Technical Stack

  • Kubernetes, GPU/GradientAI
  • Python, PyTorch, TensorFlow, Hugging Face
  • CUDA, TensorRT, vLLM
  • Linux
  • Terraform, CloudFormation, Pulumi
  • Go, Rust, C++
  • Prometheus, Grafana, Datadog, New Relic
  • PostgreSQL, MySQL, MongoDB, Redis

Team & Environment

You will be part of the AI/ML Support team at DigitalOcean, a group dedicated to solving the most challenging technical problems for customers building advanced AI/ML workloads.

DigitalOcean is an equal opportunity employer.

Required Skills
KubernetesGPUAI/ML InfrastructurePythonPyTorchTensorFlowHugging FaceCUDATensorRTvLLMLinuxDevOpsSREDistributed ComputingPerformance Optimization
Planning long-term in Thailand?

Full relocation support, start to finish

From visa strategy to housing, banking, and schools for your family — SVBL plans and manages every detail of your move to Thailand so nothing falls through the cracks.

Complete relocation planning
Family visa & school enrollment
Banking & insurance setup
Cultural integration support
Plan your move
One partner for everything
About company
DigitalOcean
DigitalOcean builds the simplest scalable cloud for a strong community of top talent and the dreamers and builders in the world.
All jobs at DigitalOcean Visit website
Job Details
Department Information Technology
Category infrastructure
Posted 2 months ago