Staff Software Engineer - Managed Kubernetes

Lambda Corporation

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU. If you'd like to build the world's best AI cloud, join us. *Note: This position requires presence in our San Francisco, San Jose, or Bellevue office location 4 days per week; Lambda’s designated work from home day is currently Tuesday. About the Role Lambda is building the AI Cloud of the future. We are seeking a Staff Engineer to help our development of our Managed Kubernetes platform. Think GKE, but purpose-built for AI workloads and running on bare metal. This is a foundational technical leadership role where you will shape the infrastructure that powers the next generation of AI training and inference at scale. As a Staff Engineer on our Orchestration team, you will collaborate to help drive the technical vision for Lambda's managed orchestration services, including Managed Kubernetes, Managed Slurm on Kubernetes, and higher-level platform services for inference and AIOps. You'll work at the intersection of distributed systems, GPU-accelerated computing, and Cloud Native infrastructure to build systems that are reliable, performant, and elegantly simple for our customers. This is not a role for someone who just operates Kubernetes; it is a technical leadership role for an engineer who has synthesized the core domains of infrastructure (compute, network, storage, security) and can design holistic solutions across all of them. You'll be working closely with NVIDIA's open-source ecosystem, and partnering with internal teams across the stack to deliver a world‑class managed platform. What You'll Do: Product Engineering Drive technical vision for Lambda's Managed Kubernetes bare‑metal platform, including control plane scalability, multi‑tenancy, cluster lifecycle management, and high availability Integrate and extend NVIDIA's open‑source ecosystem: GPU Operator, Network Operator, DCGM, NCCL, and emerging projects like AICR and Topograph for topology‑aware scheduling and placement Design GPU‑aware orchestration systems Lead development of services that power our managed services Inform on and help with networking solutions for AI workloads: CNI integration (Cilium, Multus), high‑performance fabrics (InfiniBand, RoCE), RDMA, and GPUDirect. You will work closely with our Network team to define and drive requirements Inform and help with storage architecture requirements for AI workloads. You will partner with Storage teams on what managed K8s, Slurm, and future services need Build the foundation for Managed Slurm on Kubernetes, enabling traditional HPC workloads to run seamlessly alongside Kubernetes workload Design higher‑level platform services for inference, including model serving infrastructure, autoscaling based on inference load, and multi‑model deployment patterns Design self‑healing systems and automation for incident response, root cause analysis, and platform resilience Lead chaos engineering efforts to validate system behavior under failure conditions at scale Establish operational excellence for a managed service: upgrade automation, security patching, and zero‑downtime maintenance Cross‑Functional Infrastructure Leadership Serve as the technical bridge between Orchestration and other infrastructure teams (Network, Storage, Security), translating platform requirements into actionable specifications Drive infrastructure‑wide decisions that enable successful managed services. You’re someone who understands what’s needed end‑to‑end, not just at the Kubernetes layer Provide input on bare‑metal provisioning, network topology, and storage systems to ensure they meet the needs of managed the services being built by the Orchestration organization Champion consistency and standardization across Lambda's infrastructure stack Work directly with customers and internal teams to understand existing deployments and chart a path to the managed platform Technical Leadership Set technical direction for Kubernetes services across the Orchestration team, influencing roadmap and prioritization Drive reviews and design sessions, ensuring we build systems that are scalable, maintainable, and aligned with customer needs Mentor and grow engineers, establishing best practices for Kubernetes development, distributed systems, and Cloud Native engineering Collaborate cross‑functionally with Network, Storage, Security, and Customer Success teams Engage with NVIDIA and the open‑source community to stay current on GPU orchestration technologies and contribute back where appropriate Represent Lambda externally through technical blog posts, conference talks, and strategic customer engagements Shape our AIOps vision: design intelligent systems for automated capacity planning, anomaly detection, and predictive maintenance of cloud infrastructure Who You Are You are a creative, innovative engineer who operates at high velocity. You don't just solve problems. You find elegant solutions and ship them quickly. You embrace modern tools and AI‑assisted development (like Claude Code) to accelerate your productivity and multiply your impact. You're energized by building new things, not maintaining the status quo. Required Qualifications 10+ years of experience in software engineering, platform engineering, or SRE, with at least 5 years focused on Kubernetes at scale Expert‑level understanding of Kubernetes internals: API machinery, controllers, schedulers, operators, CRDs, CSI, CNI, and the extension patterns that make Kubernetes powerful Holistic infrastructure expertise: you've synthesized knowledge across compute, networking, storage, and security, not just Kubernetes in isolation. You can build solutions that span the full stack Strong software engineering skills in Go (required) and Python; you write production‑quality code, not just scripts Deep experience with GPU orchestration in Kubernetes: NVIDIA GPU Operator, device plugins, DCGM, MIG, time‑slicing, and GPU‑aware scheduling. Familiarity with NVIDIA Network Operator and GPUDirect is strongly preferred Proven track record of technical leadership: driving design decisions across teams, mentoring engineers, and influencing infrastructure direction beyond your immediate scope Deep experience designing and operating managed services or multi‑tenant platforms. You understand what it takes to run infrastructure for external customers Strong understanding of distributed systems principles: consensus, fault tolerance, consistency models, and graceful degradation Experience with observability at scale: Prometheus, Grafana, distributed tracing, and building actionable alerting systems Solid knowledge of Linux systems and networking (L2‑L7), including high‑performance networking concepts (RDMA, InfiniBand, RoCE) Experience with infrastructure‑as‑code and GitOps workflows Preferred Qualifications Experience building and operating managed Kubernetes services (GKE, EKS, AKS, or similar) or working on Kubernetes control plane components Hands‑on experience with NVIDIA's open‑source ecosystem beyond GPU Operator: Network Operator, NCCL tuning, Topograph, AICR, or similar emerging projects Familiarity with HPC and traditional job schedulers (Slurm) and Kubernetes‑native batch scheduling (KAI, Volcano, Kueue) Background in confidential computing Experience migrating customers or workloads from legacy/bespoke infrastructure to standardized platforms Contributions to CNCF projects, Kubernetes SIGs, or NVIDIA open‑source projects Familiarity with security and compliance in multi‑tenant environments: RBAC, Pod Security Standards, network policies, workload isolation Background in ML infrastructure: training clusters, inference serving, simulation Why Lambda Lambda is building the essential infrastructure for the AI era. We're not just another cloud provider: we're a company founded by ML practitioners, for ML practitioners. Our customers include leading AI research labs and enterprises pushing the boundaries of what's possible with artificial intelligence. What makes this role special You’ll be building core platform services the world’s largest AI companies will consume NVIDIA partnership: Deep integration with NVIDIA's GPU and networking stack, working with cutting‑edge open‑source tooling Real technical challenges: Massive scale GPU clusters and the unique demands of AI workloads Cross‑stack influence: Shape not just Kubernetes, but the network, storage, and compute infrastructure that supports it Direct impact: Your work enables AI breakthroughs. Every model trained on Lambda benefits from systems you build World‑class team: Work alongside engineers with deep expertise in ML, systems, and infrastructure Salary Range Information The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About Lambda Founded in 2012, with 500+ employees, and growing fast Our investors notably include TWG Global, US Innovative Technology Fund (USIT), Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In‑Q‑Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, Gradient Ventures, Mercato Partners, SVB, 1517, and Crescent Cove We have research papers accepted at top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG Our values are publicly available: We offer generous cash & equity compensation Health, dental, and vision coverage for you and your dependents Wellness and commuter stipends for select roles 401k Plan with 2% company match (USA employees) Flexible paid time off plan that we all actually use A Final Note You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills. Equal Opportunity Employer Lambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law. #J-18808-Ljbffr

Apply

Vacancy posted 6 hours ago

Similar jobs that could be interesting for youBased on the Staff Software Engineer - Managed Kubernetes in San Jose, CA vacancy

Staff Software Engineer, Managed Orchestration (Managed Kubernetes)
$220k - $250k
...Staff Software Engineer Crusoe is on a mission to accelerate the abundance of energy and intelligence... ...reducing operating model, as well as managing critical hardware, software, and... ...in advancing our managed Kubernetes and AI training clusters, ensuring they...
Suggested
Temporary work
Crusoe
Sunnyvale, CA
7 days ago
Staff Software Engineer
$230k - $323k
...Staff Software Engineer Applied Intuition, Inc. is powering the future of physical AI. Founded in... ...flexibility and trust our employees to manage their schedules responsibly. This may... ...frameworks (such as React, GraphQL, Docker, or Kubernetes) Experience working with simulation...
Suggested
Full time
For contractors
For subcontractor
Casual work
Work at office
Remote work
Day shift
Applied Compute
Sunnyvale, CA
7 days ago
Staff Software Engineer
$160k - $230k
...Staff Software Engineer TENEX is an AI-native, automation-first, built-for-scale Managed Detection and Response (MDR) provider. We are a force multiplier for defenders, helping... ...like Docker and orchestration tools like Kubernetes. Experience with CI/CD pipelines....
Suggested
Work from home
TenEx
San Jose, CA
5 days ago
Staff Software Engineer: AI Inference Infra & Kubernetes
...Systems in Sunnyvale, CA is seeking a Member of Technical Staff (Software Engineer) to implement infrastructure for high-performance, low-latency... ...development experience. The position involves deploying Kubernetes services, optimizing resource allocation, and...
Suggested
Cerebras Systems
Sunnyvale, CA
3 days ago
Member of Technical Staff (Software Engineer)
$169.6k
...applications, without the hassle of managing hundreds of GPUs or TPUs. Cerebras... ...openings for Member of Technical Staff (Software Engineer) Title : Member of Technical... ...inference service. Deploy and configure Kubernetes services to ensure scalability and...
Suggested
Full time
Part time
Internship
Remote work
CEREBRAS SYSTEMS INC.
Sunnyvale, CA
4 days ago
Staff Software Engineer, Kubernetes Cloud
...CA Headquarters. Our Team's Vision Our Engineering team is driven by a culture that thrives... ...modern container platforms such as Kubernetes, Istio, OpenShift, AKS, EKS, GKE etc. You... ...years building distributed & scalable software systems. Proficient in at least one higher...
Immediate start
Illumio
Sunnyvale, CA
6 hours ago
Kubernetes Platform Engineer - AI Infra (On-Prem, Hybrid)
$152.5k - $219.2k
...Cisco Systems, Inc. is seeking a Kubernetes Platform Engineer in San Jose, CA to design and manage large-scale on-prem Kubernetes infrastructures. This hybrid role requires strong hands-on experience with Kubernetes, supporting AI/ML workloads, and implementing Infrastructure...
Cisco
San Jose, CA
1 day ago
Staff Software Engineer, Secure Compute Platform (Kubernetes)
...IBM Computing is seeking a Staff Software Engineer for the Secure Compute Platform team. In this role, you will lead technical direction and... ...have over 10 years of relevant experience, deep expertise in Kubernetes, and proficiency in Go. This position allows for remote...
Remote work
IBM Computing
San Jose, CA
4 days ago
Staff Software Engineer, Kubernetes Cloud & Zero Trust
...cybersecurity company in Sunnyvale seeks an experienced engineer to develop cutting-edge Zero Trust Segmentation solutions. The role involves building scalable software systems, enhancing container platforms like Kubernetes, and mentoring junior engineers. Candidates should...
Illumio
Sunnyvale, CA
2 days ago
Senior Staff Software Engineer, Performance (Tech Lead) Veza
$190.9k - $334.1k
...Senior Staff Software Engineer, Performance (Tech Lead) — Veza Full-time Employee Type: Regular Region... ...30 billion access permissions under management, global enterprises including... ...scale SaaS platforms Familiarity with Kubernetes and observability tools Exposure to security...
Full time
Work at office
Remote work
Flexible hours
Centaur Labs
Santa Clara, CA
6 hours ago
AI/LLM Platform Engineer Lead Kubernetes & GPU Infra
$130k - $170k
...NTT DATA is hiring a Platform Engineer in Santa Clara, California, to lead the design and operation of scalable infrastructure supporting... ...of experience in Platform Engineering. Your role involves managing Kubernetes environments, driving Infrastructure-as-Code practices, and...
NTT DATA
Santa Clara, CA
5 hours ago
AI/LLM Infra Platform Engineer & Kubernetes Lead
$130k - $170k
...NTT DATA, Inc. is seeking a Platform Engineer in Santa Clara, California, to lead the design and implementation of scalable... ...a team, and ensuring strong security posture while managing production-grade Kubernetes environments. The ideal candidate will have over 5 years...
NTT DATA
Santa Clara, CA
1 day ago
Staff Data Platform Software Engineer, Graph Veza
$176.1k - $308.2k
...access permissions under management, global enterprises including... ..., and AI agents. ( For engineers joining Veza today, this means... ...looking for a passionate Staff Data Platform Software Engineer to join Veza’s... ...technologies, such as Docker and Kubernetes, is a plus. You Are:...
Work at office
Remote work
Flexible hours
3 days per week
ServiceNow
Santa Clara, CA
3 days ago
Staff Software Development Engineer(Media Plane)
$160k - $200k
...Software Development Position Develop and maintain media-related backend services for... ...efficient media routing, session management, and signaling services. Optimize performance... ...backends. Familiarity with Kubernetes, monitoring tools (Prometheus, Grafana)...
Full time
Edelman
Santa Clara, CA
2 days ago
Staff Software Engineer - AI
$184k - $230k
...Business Area: Engineering Seniority Level: Mid-Senior... ...insights. With as much data under management as the hyperscalers, we're... ...Cloudera is looking for a Staff Software Engineer to join the... ...development (Go, GRPC, SQL) on Kubernetes ~ Demonstrate ability...
Work from home
Relocation
Flexible hours
Cloudera
San Jose, CA
1 day ago
Cloud Platform Engineering Manager (Kubernetes & Azure)
...A global consulting firm is seeking a Cloud Engineer to manage and deliver solutions for cloud infrastructures primarily using Azure. In this... .... You are expected to have extensive experience with Kubernetes and Terraform, and strong problem-solving skills. The position...
Flexible hours
Ernst & Young Oman
San Jose, CA
5 hours ago
Staff Software Engineer Consumer Flutter Application
...data, from the photos on your phone to the footage on professional film sets. Our consumer companion app is how millions of people manage, back up, and get more out of their Sandisk drives, cards, and SSDs. We're looking for a Senior Flutter Developer to own the client...
Temporary work
Remote work
Flexible hours
Shift work
SanDisk
Milpitas, CA
5 days ago
Managing Staff Software Engineer - Embedded UI
...patients worldwide. We're a team of engineers, clinicians, and innovators united by one... ...test and calibration strategy and develop software to implement the same. Essential Job... ...or interest in leading, mentoring, and managing a team of talented developers Required...
Local area
Worldwide
Flexible hours
Intuitive
Sunnyvale, CA
1 day ago
Staff Software Engineer, Open Source (US)
$230k - $250k
...industry's most interoperable data lakehouse through a cloud-native managed service built on Apache Hudi. Onehouse enables organizations to... ...centrally store it, and make available to any downstream query engine and use case (from traditional analytics to real-time AI / ML)....
Odd job
Work at office
Remote work
OneHouse LLC
Sunnyvale, CA
5 days ago
Staff Software Engineer, AI
$184k - $230k
...Business Area: Engineering Seniority Level: Mid-Senior... .... With as much data under management as the hyperscalers, we're... ...'s largest enterprises. Staff Software Engineer, AI Cloudera... ...development (Go, GRPC, SQL) on Kubernetes ~ Demonstrate ability...
Work from home
Flexible hours
Cloudera
Alviso, CA
9 days ago
Staff Software Engineer
...Staff Software Engineer The Parallel and Distributed Computing Group for the client is developing... ...computing techniques, resource management, and AI technologies. We are currently... ...cloud infrastructure (AWS EC2, Lambda, Kubernetes), load balancing, DNS. Experience with...
Netpace
Santa Clara, CA
3 days ago
Senior/Staff Software Engineer - Infrastructure and Devops (Bay Area)
$155k - $230k
...fall short, we focus on data exposure management to keep your information safe. Our... ...towards a safer digital future. Senior/Staff Software Engineer - Infrastructure and DevOps Responsibilities... .... Orchestration frameworks like Kubernetes. Infrastructure as code (IaC) using...
Temporary work
H1b
Worldwide
Shift work
Fortanix
Santa Clara, CA
1 day ago
Staff Software Engineer, Fault Management
$197k - $291k
Staff Software Engineer, Fault Management corporate_fare Google place Sunnyvale, CA, USA Apply Qualifications Bachelor's degree or equivalent practical experience. 8 years of experience programming in C++. 5 years of experience testing, and launching software products...
Full time
Worldwide
Google Inc.
Sunnyvale, CA
3 days ago
Staff Software Engineer - Rust and Distributed Systems (Bay Area, hybrid)
...fall short, we focus on data exposure management to keep your information safe. Our... ...digital future. We’re looking for a Staff Software Engineer to join our Confidential Computing Management... ...design Hands‑on experience with: Kubernetes and containerized environments Cloud...
H1b
Worldwide
Cerebras
Santa Clara, CA
5 hours ago
Sr Staff Software Engineer (Backend)
$267k
...Sr Staff Software Engineer (Backend) Uber Sunnyvale, CA, US Job Type: Full-Time Function: Engineering Software Industry: Transportation... ...is the foundational gateway to Uber's global ecosystem. We manage the core digital presence of millions of riders, drivers,...
Full time
Work at office
Remote work
Shift work
Softbank Investment Advisers
Sunnyvale, CA
2 days ago
Senior Staff Software Engineer - Global Tech Leader
...A leading technology company in San Jose is seeking a Senior Staff Engineer for software development. This role involves leading complex projects, providing technical direction to a global team, and managing project elements including financials. The ideal candidate has...
Celestica
San Jose, CA
6 hours ago
Staff Software Engineer, Inference Cloud
...effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. Cerebras' current customers... ...hyperscale cloud inference services. About The Role As a software engineer on our AI cloud platform, you will work on our cloud platform...
Cerebras
Sunnyvale, CA
6 hours ago
Staff Software Engineer, Networking & C/C++
$136.5k - $276.5k
...Staff Software Engineer, Networking & C/C++ This role has been designed as ‘’Onsite’ with an expectation that you will primarily work from... ...backgrounds are valued and succeed here. We have the flexibility to manage our work and personal needs. We make bold moves, together,...
Work experience placement
Work at office
HPE
Sunnyvale, CA
3 days ago
Staff Software Engineer, Inference Cloud
...ML applications, without the hassle of managing hundreds of GPUs or TPUs. Cerebras... ...: Sunnyvale We're hiring a Staff Engineer to own major areas of the architecture... ...Qualifications ~8+ years of experience in software engineering, with substantial...
CEREBRAS SYSTEMS INC.
Sunnyvale, CA
3 days ago
Staff Software Engineer (Rust) - Confidential Computing Infrastructure
...platform enables organizations to run and manage confidential workloads and AI, reduce... ...and security. The Role Staff Software Engineer (Rust) - Confidential Computing Infrastructure... ...environments Build and operate Kubernetes-based confidential workload platforms...
Full time
Fortanix
Santa Clara, CA
1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Staff Software Engineer - Managed Kubernetes. Be the first to apply!