Responsibilities
- Define the technical roadmap for a bare-metal Kubernetes platform, focusing on control plane scalability, high availability, multi-tenancy, and cluster lifecycle management
- Extend and integrate open-source tools from NVIDIA’s ecosystem, including GPU Operator, Network Operator, DCGM, NCCL, AICR, and Topograph for intelligent GPU scheduling
- Develop orchestration systems designed specifically for GPU-accelerated applications
- Lead engineering efforts for core services that power managed platform offerings
- Collaborate with networking teams to shape solutions for AI workloads, including CNI plugins like Cilium and Multus, high-speed fabrics such as InfiniBand and RoCE, RDMA, and GPUDirect
- Contribute to storage architecture planning for AI use cases, working closely with storage teams to align with Kubernetes, Slurm, and future platform needs
- Develop foundational components for running Managed Slurm on Kubernetes to support traditional HPC applications
- Design platform-level services for model inference, including scalable serving infrastructure, load-based autoscaling, and multi-model deployment strategies
- Build self-healing mechanisms and automated responses for incident detection, root cause analysis, and system resilience
- Lead large-scale chaos engineering initiatives to test system reliability under failure conditions
- Establish best practices for managed service operations, including automated upgrades, security patching, and zero-downtime maintenance
- Act as a technical liaison between orchestration and infrastructure teams, translating platform needs into implementable specifications
- Influence cross-infrastructure decisions that support robust managed services, with end-to-end system understanding beyond just Kubernetes
- Provide input on bare-metal provisioning, network layout, and storage configurations to meet orchestration service requirements
- Promote consistency and standardization across the full infrastructure technology stack
- Work directly with customers and internal stakeholders to understand deployment patterns and guide migration to managed platforms
- Set technical direction for Kubernetes-based services, shaping team roadmaps and priorities
- Lead design reviews and architectural discussions to ensure systems are scalable, maintainable, and customer-aligned
- Mentor engineers and establish best practices in Kubernetes development, distributed systems, and Cloud Native engineering
- Partner with Network, Storage, Security, and Customer Success teams to deliver integrated solutions
- Engage with NVIDIA and open-source communities to track advancements in GPU orchestration and contribute improvements
- Represent the company through technical publications, conference presentations, and strategic customer interactions
- Help define the AIOps strategy by designing systems for predictive capacity planning, anomaly detection, and proactive infrastructure maintenance
Benefits
- Competitive cash and equity compensation package
- Comprehensive health, dental, and vision insurance for employees and dependents
- Wellness and commuter allowances for eligible roles
- 401k plan with a 2% employer contribution for U.S.-based employees
- Flexible paid time off policy that is actively used by the team
Compensation
Competitive cash and equity compensation package
Work Arrangement
On-site — San Francisco, San Jose, Bellevue
Work Arrangement
- This position requires presence in our San Francisco, San Jose, or Bellevue office location 4 days per week
- Lambda’s designated work from home day is currently Tuesday
Other
- You do not need to match all of the listed expectations to apply for this position
- Lambda is an Equal Opportunity employer
Not specified