Responsibilities

Define the technical roadmap for a bare-metal Kubernetes platform, focusing on control plane scalability, high availability, multi-tenancy, and cluster lifecycle management
Extend and integrate open-source tools from NVIDIA’s ecosystem, including GPU Operator, Network Operator, DCGM, NCCL, AICR, and Topograph for intelligent GPU scheduling
Develop orchestration systems designed specifically for GPU-accelerated applications
Lead engineering efforts for core services that power managed platform offerings
Collaborate with networking teams to shape solutions for AI workloads, including CNI plugins like Cilium and Multus, high-speed fabrics such as InfiniBand and RoCE, RDMA, and GPUDirect
Contribute to storage architecture planning for AI use cases, working closely with storage teams to align with Kubernetes, Slurm, and future platform needs
Develop foundational components for running Managed Slurm on Kubernetes to support traditional HPC applications
Design platform-level services for model inference, including scalable serving infrastructure, load-based autoscaling, and multi-model deployment strategies
Build self-healing mechanisms and automated responses for incident detection, root cause analysis, and system resilience
Lead large-scale chaos engineering initiatives to test system reliability under failure conditions
Establish best practices for managed service operations, including automated upgrades, security patching, and zero-downtime maintenance
Act as a technical liaison between orchestration and infrastructure teams, translating platform needs into implementable specifications
Influence cross-infrastructure decisions that support robust managed services, with end-to-end system understanding beyond just Kubernetes
Provide input on bare-metal provisioning, network layout, and storage configurations to meet orchestration service requirements
Promote consistency and standardization across the full infrastructure technology stack
Work directly with customers and internal stakeholders to understand deployment patterns and guide migration to managed platforms
Set technical direction for Kubernetes-based services, shaping team roadmaps and priorities
Lead design reviews and architectural discussions to ensure systems are scalable, maintainable, and customer-aligned
Mentor engineers and establish best practices in Kubernetes development, distributed systems, and Cloud Native engineering
Partner with Network, Storage, Security, and Customer Success teams to deliver integrated solutions
Engage with NVIDIA and open-source communities to track advancements in GPU orchestration and contribute improvements
Represent the company through technical publications, conference presentations, and strategic customer interactions
Help define the AIOps strategy by designing systems for predictive capacity planning, anomaly detection, and proactive infrastructure maintenance

Benefits

Competitive cash and equity compensation package
Comprehensive health, dental, and vision insurance for employees and dependents
Wellness and commuter allowances for eligible roles
401k plan with a 2% employer contribution for U.S.-based employees
Flexible paid time off policy that is actively used by the team

Compensation

Competitive cash and equity compensation package

Work Arrangement

On-site — San Francisco, San Jose, Bellevue

Work Arrangement

This position requires presence in our San Francisco, San Jose, or Bellevue office location 4 days per week
Lambda’s designated work from home day is currently Tuesday

Other

You do not need to match all of the listed expectations to apply for this position
Lambda is an Equal Opportunity employer

Not specified

Lambda is hiring a Staff Software Engineer - Managed Kubernetes

Responsibilities

Benefits

Compensation

Work Arrangement

Work Arrangement

Other