Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior Systems Software Engineer, Kubernetes Node Lifecycle - DGX Cloud

$184k - $287.5k
Full-time

NVIDIA

At NVIDIA, the DGX Cloud division merges fresh hardware and software innovations to offer leading accelerated computing solutions for the most challenging AI workloads worldwide. Our team of skilled engineers is committed to addressing major global issues, consistently advancing technology, and making a difference in millions of lives around the world! We are looking for a Senior Systems Software Engineer with strong experience in Kubernetes node engineering, OS image packaging, and cloud infrastructure. The ideal candidate will possess deep hyperscaler-level knowledge across the entire node lifecycle. This covers CAPI providers, bring-your-own-node onboarding, OS image build pipelines, packaging, and nodepool management. They must have the technical depth needed to maintain cluster reliability at frontier AI scale. In this vital role, you will manage the node layer within NVIDIA Kubernetes Engine (NKE). Your work will ensure it scales to fulfill DGX Cloud's two main goals: supporting internal researchers and enabling NCPs. Are you prepared to innovate? What you'll be doing: Direct the building and refinement of CAPI providers for NVIDIA Kubernetes Engine, maintaining steady, consistent, and scalable node provisioning across DGX Cloud and NCP environments. Develop and maintain bring-your-own-node workflows that allow customers to integrate different NVIDIA hardware into NKE clusters while ensuring high operational consistency. Coordinate OS image generation, packaging, deployment, and update processes for NKE nodes. Ensure images are fine-tuned for NVIDIA GPU workloads and satisfy enterprise- and cloud-grade security and compliance criteria. Develop and sustain node image hardening pipelines, incorporating CIS benchmarks, automated CVE remediation, and promotion gates connected to security posture. Develop and maintain automated test suites for node images. These tests verify accuracy across Kubernetes versions and NVIDIA hardware configurations. This process occurs prior to production deployment and facilitates continuous validation through modern CI/CD pipelines. Handle nodepool lifecycle at scale, including provisioning, upgrades, drain and cordon workflows, and seamless node replacement across very large clusters with diverse NVIDIA hardware. Examine, resolve, and determine underlying causes of node-layer faults in production NKE clusters, such as those involving image configuration, driver packaging, kubelet operation, and hardware activation, and review and optimize the node layer in real-world high-scale scenarios. Partner with upstream communities including Cluster API, Kubernetes, and CNCF projects to establish node provisioning and lifecycle standards in accordance with NKE requirements. Communicate your progress and findings at internal and external gatherings such as KubeCon and GTC. What we need to see: 8 years of experience with a background in systems software, cloud infrastructure, or Kubernetes node engineering. Bachelor’s or Master’s degree in Engineering (Electrical, Computer Engineering, Computer Science) or equivalent experience. Deep expertise in Cluster API (CAPI), including provider development and full machine lifecycle from provisioning to deletion. Extensive experience with OS image build pipelines, node image packaging, and delivery systems for Kubernetes nodes (for example image-builder, containerd, cloud-init, packer). Practical experience with bring-your-own-node models and integrating diverse hardware into live Kubernetes environments, including large-scale nodepool lifecycle management and upgrades. Strong understanding of kubelet configuration, node bootstrap, and the Kubernetes node registration lifecycle. Experience with node image security, including vulnerability scanning, patch automation, and compliance gating as part of image build pipelines. Proficiency in Golang and/or Python, and hands-on experience with at least one major public cloud provider (GCP, AWS, Azure, OCI or equivalent). Ways to stand out from the crowd: Direct experience building or maintaining node image pipelines for a hyperscaler Kubernetes distribution (GKE, EKS, AKS, OKE, or equivalent). Experience with supply chain security and hardening for node images, including image signing, provenance attestation, SBOM generation, CIS benchmark consistency, and automated CVE remediation. Experience with automated node provisioning and optimal sizing at scale (for example Karpenter, GKE NAP or similar) and how these interact with GPU workload scheduling. Strong operational experience working with immutable OS image distributions (such as Flatcar, Bottlerocket, Azure Linux) and debugging node-layer failures in large Kubernetes clusters. Proven background of upstream contributions to Cluster API, Kubernetes or related CNCF projects, combined with excellent communication and interpersonal abilities. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until June 14, 2026. This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. NVIDIA pioneered accelerated computing. Today, our AI infrastructure powers global intelligence, transforming every industry. Learn more about NVIDIA.

Vacancy posted 9 hours ago
Similar jobs that could be interesting for youBased on the Senior Systems Software Engineer, Kubernetes Node Lifecycle - DGX Cloud in Santa Clara, CA vacancy
  • $184k - $287.5k

    The DGX Cloud organization at NVIDIA brings...  ...edge hardware and software innovation to deliver...  ...of innovative engineers dedicated to...  ...for an outstanding Senior Systems Software Engineer...  ...technologies such as Kubernetes and containers,...  ...clusters to ultra-large node and object counts... 
    Cloud
    Senior
    Full time
    Worldwide

    NVIDIA

    Santa Clara, CA
    9 hours ago
  • $184k - $287.5k

    Senior Software Engineer - GPU Cloud Infrastructure We are looking for a Senior Software Engineer who sees...  ...role in upstream communities such as Kubernetes (k8s) and KubeVirt, adding features...  ..., operations). Own and document system and software architecture, designs,... 
    Cloud
    Senior
    Worldwide

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • We are looking for a Senior Software Engineer to become part of our storage management plane team. The...  ...Will Be Doing Maintain and develop Kubernetes operators and our Container Storage Interface...  ..., Bash or similar. Experience with Node.js is a must. At least 5 years of... 
    Cloud
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    2 days ago
  • $184k - $287.5k

    Overview NVIDIA DGX Cloud is building and operating...  ...We are looking for Senior Software Engineers to help build the automation...  ..., and operational systems that make GPU...  ...engineering team focused on Kubernetes-based infrastructure...  ...repair, and cluster lifecycle operations. Improve... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $208k - $327.75k

     ...world-class Senior Product...  ...the NVIDIA DGX is the undisputed...  ...a 1,000-node private cluster...  ...the public cloud? The...  ...role, own the software-defined blueprint...  ...On-Prem Lifecycle: Define the...  ...to Kubernetes: Lead the integration...  ...of DGX systems into the...  ...of multiple engineering fields. As... 
    Cloud
    Senior
    Night shift

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago
  • $152k - $241.5k

    ## Senior AI Infrastructure Engineer - DGX CloudApplylocations:...  ...join our DGX Cloud group. This...  ...production systems with high efficiency...  ...of software and systems...  ...technologies like Kubernetes and...  ...GPU and multi-node clusters.* Engage...  ...improve the whole lifecycle of services—... 
    Cloud
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago
  • $320k

     ...company is seeking a seasoned individual to spearhead DGX Cloud strategy, focusing on GPU lifecycle and operational health. The ideal candidate will...  ...collaborating with stakeholders, and managing full software and system lifecycles. If you're passionate about technology and... 
    Cloud
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    17 hours ago
  • $184k - $356.5k

    NVIDIA Corporation is seeking a Senior Software Engineer for DGX Cloud Production Engineering in Santa Clara, CA. You...  ...automation, tooling, and operational systems. The ideal candidate will have extensive experience with Kubernetes, programming skills, and a strong... 
    Cloud
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago
  •  ..., California is seeking a skilled software engineer to design and maintain its cloud-native platform services. The ideal...  ...Java, or Go and deep expertise in Kubernetes architecture. This role involves owning the full development lifecycle and fostering team excellence... 
    Cloud
    Senior

    Illumio

    Sunnyvale, CA
    2 days ago
  • $136.5k - $276.5k

     ...Senior Systems Software Engineer This role has been designed as "Onsite" with an expectation that you...  ...Packard Enterprise is the global edge-to-cloud company advancing the way people...  ...Cisco is a strong plus. Knowledge of Kubernetes and associated technologies... 
    Cloud
    Senior
    Work experience placement
    Work at office
    2 days per week

    Hewlett Packard Enterprise

    Sunnyvale, CA
    2 days ago
  • $152k - $241.5k

     ...motivated Performance Engineer to influence the...  ...‑GPU and multi‑node clusters. Study the...  ..., networking) and software components in the...  ...understanding of computer system architecture,...  ...with containers, cloud provisioning, and scheduling tools (Kubernetes, SLURM, Ansible,... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  •  ...a motivated Performance engineer to influence the roadmap...  ...large multi-GPU and multi-node clusters. Study the...  ...understanding of computer system architecture, HW‑SW interactions...  ...with containers, cloud provisioning and scheduling tools (Kubernetes, SLURM, Ansible, Docker)... 
    Cloud
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    2 days ago
  • $184k - $287.5k

     ...best work. The video software team is seeking someone...  ...and passionate about system software development....  ...including ultra‑low latency cloud gaming, video...  ...features through the entire lifecycle from requirements and...  ...degree in Electrical Engineering or Computer Science (or... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • $200k - $322k

    Senior Technical Program Manager - DGX Cloud Infra Security page is loaded## Senior Technical...  ...roadmaps and the software development lifecycle. It aligns product...  ...Compliance, SRE, and Engineering to continually advance...  ...documentation, including system diagrams, process... 
    Cloud
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    2 days ago
  • Overview NVIDIA is seeking a Senior Software Engineer to join our CSP Engagements team, focusing on system software for datacenter products...  ...responsibilities to enable cloud service providers with next‑...  ...experience with virtualization, Kubernetes, and cloud‑native... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • $152k - $241.5k

     ...motivated Partner Enablement Engineer to guide our key partners...  ...to isolate issues on new systems and platforms, including cloud platforms (Azure, AWS,...  ...applications on multi-node clusters Document and conduct...  ...(Docker, Docker Swarm, Kubernetes, SLURM, Ansible) Adaptability... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • GeForce NOW is Nvidia’s Cloud Gaming service, streaming games at the highest quality...  ...devices. We are looking for a Senior System Software Engineer for Cloud who sees the big picture of...  ...members and drive best practices in Kubernetes, observability, and infrastructure... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  •  ...A leading cloud service provider in California is seeking a Senior Platform Engineer II to champion reliability and administer multi-tenant Kubernetes platforms. The ideal candidate will have over 5 years of experience in Kubernetes, a strong background in resilient application... 
    Cloud
    Senior

    CoreWeave

    Sunnyvale, CA
    3 days ago
  • GeForce NOW is NVIDIA’s Cloud Gaming service that streams...  ...using GPUs and proprietary software to deliver high‑quality streaming...  ...We are looking for a Senior System Software Engineer for Cloud who sees the big...  ..., APIs, frameworks) on Kubernetes. Architect and manage Kubernetes... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • $165k - $242k

     ...A cloud service provider is seeking a Senior Software Engineer II for their Inference team in Sunnyvale, California. In this role, you'll lead...  ...extensive experience with distributed systems, strong Python/Go skills, and Kubernetes expertise. This position offers a competitive... 
    Cloud
    Senior

    CoreWeave

    Sunnyvale, CA
    3 days ago
  • A leading technology company is looking for a Java SRE Engineer to support large-scale cloud migrations and production systems on AWS and Kubernetes. You will lead migrations, design robust AWS EKS platforms, and implement deployment strategies. The ideal candidate has... 
    Cloud
    Senior

    EITACIES Inc.

    Santa Clara, CA
    4 days ago
  • $139k - $204k

     ...located in Sunnyvale, California, is looking for a Senior Software Engineer to advance its orchestration platform, including SUNK (Slurm on Kubernetes). Ideal candidates will have 3–5 years of experience in distributed systems, expertise in Go, and hands-on Kubernetes... 
    Cloud
    Senior

    Energy Jobline ZR

    Sunnyvale, CA
    3 days ago
  • $184k - $287.5k

     ...redefining AI, HPC, and cloud computing. To...  ...globally, our diagnostic systems need to evolve across...  ...visionary technical leader to engineer and propel innovation...  ...involve hardware and software tools to develop the...  ...effectively across the server lifecycle.* Drive hardware... 
    Cloud
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago
  • NVIDIA Gruppe is seeking a Senior Software Engineer to join our Cloud Service Provider Engagements team in Santa Clara, California. This role involves...  ...environment. The ideal candidate must have expertise in Kubernetes internals and significant experience with next-gen... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • NVIDIA Gruppe in Santa Clara is seeking a Senior Software Engineer to design and build next-generation cloud platforms. This role involves developing scalable solutions...  ...using advanced technologies such as GPUs and Kubernetes. Ideal candidates will have over 7 years of... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $128k - $176k

     ...leading technology firm in Santa Clara seeks a DevOps Engineer with a strong background in CI/CD systems and cloud environments. The ideal candidate will have over...  ...in DevOps, proficient in tools like Jenkins and Kubernetes, and will play a key role in developing... 
    Cloud
    Senior
    Full time

    Victrays

    Santa Clara, CA
    4 days ago
  •  ...Crusoe is looking for a Senior Software Engineer to join our cloud software team in Sunnyvale, California. The role involves contributing...  ...cutting-edge infrastructure and requires expertise in Kubernetes, GoLang, and systems engineering. You will work collaboratively to... 
    Cloud
    Senior

    Crusoe

    Sunnyvale, CA
    3 days ago
  • $356.5k

    NVIDIA Gruppe is seeking an experienced AI infrastructure software engineer to join its DGX Cloud AI Efficiency Team in Santa Clara, California. This...  ...workloads and ensuring high availability and efficiency of AI systems. The ideal candidate will have over 8 years of... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • $200k - $322k

    NVIDIA Corporation is seeking a Senior Technical Program Manager for DGX Cloud Infra Security in Santa Clara, CA. You will lead security initiatives to embed compliance controls and governance frameworks into the cloud infrastructure. The ideal candidate has over 12 years... 
    Cloud
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    2 days ago
  • $224k - $356.5k

     ...world.As part of the DGX Cloud organization, the Attestation...  ...of NVIDIA systems at scale. You’ll own highly...  ...security, silicon, and cloud engineering teams to turn embedded...  ..., platform, and software teams to deliver end-to...  ...-native platforms: Kubernetes, Docker/containers, and... 
    Cloud
    Senior
    Remote work

    NVIDIA Corporation

    Santa Clara, CA
    2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Systems Software Engineer, Kubernetes Node Lifecycle - DGX Cloud. Be the first to apply!