Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior Systems Software Engineer, Kubernetes Scale - DGX Cloud

NVIDIA

The DGX Cloud organization at NVIDIA brings together cutting‑edge hardware and software innovation to deliver industry‑leading accelerated computing for the world’s most adventurous AI workloads. We are a team of innovative engineers dedicated to solving some of the world’s biggest challenges, constantly driving advancements, and impacting millions of lives worldwide. We are looking for an outstanding Senior Systems Software Engineer with deep experience in distributed systems, open‑source technologies such as Kubernetes and containers, and a strong background in systems performance and scalability. The ideal candidate brings broad, end‑to‑end experience across the stack – from GPU operator and device plugins to distributed inference serving and cloud platforms – along with the technical depth to investigate and address exciting, real‑world problems at scale. What You’ll Be Doing: Drive end‑to‑end performance and scale characterization for the NVIDIA DGX Cloud software stack, from Kubernetes control and data planes through NVIDIA components such as GPU Operator, Network Operator, DCGM, NIM, and distributed inference serving, following issues from orchestration down to the metal. Collaborate with AI researchers, developers and customers to develop innovative, automated tests that simulate real user workloads using custom‑built and leading open‑source tools and frameworks. Deep dive into performance and scale issues in complex distributed systems, including interactions between Kubernetes and the NVIDIA software stack, to identify and resolve root causes. Design and develop monitoring, reporting and analysis tools for performance and scale testing across software, GPU and CPU resources. Triage, debug and root cause issues related to operating Kubernetes clusters at ultra‑large scale, ensuring reliability and efficiency. Build and maintain a high‑velocity framework that enables continuous, always‑on performance and scale testing via a modern CI/CD pipeline. Document research, methodologies and results clearly and concisely, and present findings at internal and external venues, including community conferences such as KubeCon and GTC. Engage efficiently with upstream communities – including Kubernetes, CNCF and NVIDIA open‑source projects – to validate performance and scalability of AI workloads early and help shape design and development decisions. What We Need to See: 8+ years of experience in Computer Architecture, Networking, Storage systems, Accelerators and a Bachelor’s/Master’s in Engineering (preferably Electrical Engineering, Computer Engineering, or Computer Science) or equivalent experience. Expertise in Kubernetes and familiarity with related CNCF projects. Background in working with large‑scale parallel and distributed accelerator‑based systems. Expertise optimizing performance and AI workloads on large‑scale systems. Experience with performance modeling and benchmarking at scale. Proficiency in Golang/Python. Background with the NVIDIA software ecosystem in both training and inference domains. Expertise with at least one of public CSP infrastructure (GCP, AWS, Azure, OCI, etc.). Ways to Stand Out From the Crowd: Strong operational experience with any one of the Kubernetes distributions. Prior experience scaling Kubernetes clusters to ultra‑large node and object counts. Demonstrated history of working in the open‑source community. Excellent communication and interpersonal abilities. PhD in relevant areas. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000USD – 287,500USD for Level4, and 224,000USD – 356,500USD for Level5. You will also be eligible for equity and benefits. NVIDIA is committed to fostering an inclusive work environment and is proud to be an equal‑opportunity employer. We do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. #J-18808-Ljbffr NVIDIA

Vacancy posted 5 days ago
Similar jobs that could be interesting for youBased on the Senior Systems Software Engineer, Kubernetes Scale - DGX Cloud in Santa Clara, CA vacancy
  • $184k - $287.5k

    Senior Systems Software Engineer - DGX Cloud, NVIDIA NVIDIA is a leader in hardware and software innovation for AI...  ...scalability analysis across the Kubernetes‑based accelerated runtime stack (control...  ...smooth, low‑latency inference scaling on Kubernetes across thousands of... 
    Cloud
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    4 days ago
  • $184k - $287.5k

     ...on the world. The DGX Cloud organization at...  ...edge hardware and software innovation to deliver...  ...forward‑thinking engineers tackling some of...  ...searching for a Senior Systems Software Engineer...  ...systems, Kubernetes, containers, and...  ...problems at large scale and help shape how... 
    Cloud
    Senior
    Full time
    Remote work

    NVIDIA

    Santa Clara, CA
    5 days ago
  • $184k - $356.5k

    NVIDIA Corporation is seeking a Senior Systems Software Engineer based in Santa Clara, California....  ...characterization for the NVIDIA DGX Cloud software stack and collaborating...  ...and a related degree. Knowledge of Kubernetes and large-scale systems is essential. Competitive... 
    Cloud
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    11 hours ago
  • NVIDIA Gruppe is seeking a Senior Systems Software Engineer for its DGX Cloud organization, tasked with shaping AI infrastructure for large-scale, cost-effective deployments. Your focus will...  ...and scalability of AI workloads on Kubernetes-based systems. The role requires... 
    Cloud
    Senior
    Remote job

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $184k - $287.5k

    Senior Software Engineer - GPU Cloud Infrastructure We are looking for a Senior Software...  ...upstream communities such as Kubernetes (k8s) and KubeVirt, adding...  ...methodology for high‑scale, high‑availability services...  ...operations). Own and document system and software architecture,... 
    Cloud
    Senior
    Worldwide

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • Senior Systems Software Engineer - GPU Performance at Scale We are looking for a dedicated engineer for the Senior Systems Software Engineer role, focusing on GPU...  ...computing software stacks (CUDA). Experience with modern cloud and container‑based enterprise computing... 
    Cloud
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    1 day ago
  • $272k - $431.25k

     ...Corporation is looking for a Principal Software Engineer for DGX Cloud Production Engineering to...  ...and lead efforts in large-scale GPU operations. The successful...  ...years of experience in distributed systems, with strong skills in Kubernetes and automation. Located in Santa... 
    Cloud
    Remote job

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $224k - $356.5k

    NVIDIA is looking for a hardworking Sr. Systems Software Engineer to work on platform software based on open-source container runtimes and Kubernetes technologies. We expect you to have...  ...opportunity to join the core group working on Cloud Native technologies enabling NVIDIA... 
    Cloud
    Senior
    Full time
    Work experience placement

    NVIDIA

    Santa Clara, CA
    2 hours ago
  • $356.5k

    NVIDIA Gruppe is seeking an experienced AI infrastructure software engineer to join its DGX Cloud AI Efficiency Team in Santa Clara, California. This...  ...workloads and ensuring high availability and efficiency of AI systems. The ideal candidate will have over 8 years of... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • NVIDIA Corporation is seeking a Senior Systems Software Engineer for their DGX Cloud team in Santa Clara, California. In...  ...and scalability analysis of Kubernetes-based infrastructure, ensuring high...  ...innovative solutions that empower large-scale AI deployments. Ideal candidates... 
    Cloud
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    4 days ago
  •  ...a motivated Performance engineer to influence the roadmap...  ...benchmarking and triage on large‑scale HPC clusters. Good understanding of computer system architecture, HW‑SW...  ...Familiar with containers, cloud provisioning and scheduling tools (Kubernetes, SLURM, Ansible, Docker).... 
    Cloud
    Senior

    NVIDIA

    Santa Clara, CA
    4 days ago
  • $152k - $241.5k

     ...motivated Performance Engineer to influence the...  ...demand and run at scales that reach tens of...  ..., networking) and software components in the...  ...of computer system architecture, hardware...  ...with containers, cloud provisioning, and scheduling tools (Kubernetes, SLURM, Ansible, Docker... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  •  ...Senior Technical Program Manager NVIDIA's DGX Cloud (DGXC) powers AI for strategic...  ...generation AI software platforms. In...  ..., and system integration. The...  ...managing high-impact engineering programs...  ...experience with large-scale Agile tools,...  ...systems, Kubernetes-based environments... 
    Cloud
    Senior

    NVIDIA

    Santa Clara, CA
    2 days ago
  •  ...NVIDIA is seeking a Senior Software Engineer to join our CSP Engagements...  ...team, focusing on system software for...  ...responsibilities to enable cloud service providers with...  ...optimization for large‑scale data center...  ...with virtualization, Kubernetes, and cloud‑native architectures... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • GeForce NOW is Nvidia’s Cloud Gaming service, streaming games...  .... We are looking for a Senior System Software Engineer for Cloud who sees the big...  ...Responsibilities Design, build, and scale distributed cloud‑based...  ...drive best practices in Kubernetes, observability, and... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • GeForce NOW is NVIDIA’s Cloud Gaming service that...  ...and proprietary software to deliver high‑quality...  ...are looking for a Senior System Software Engineer for Cloud who sees the...  ..., frameworks) on Kubernetes. Architect and manage...  ...workload scheduling, auto‑scaling, multi‑cluster,... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • A leading technology company is looking for a Java SRE Engineer to support large-scale cloud migrations and production systems on AWS and Kubernetes. You will lead migrations, design robust AWS EKS platforms, and implement deployment strategies. The ideal candidate has... 
    Cloud
    Senior

    EITACIES Inc.

    Santa Clara, CA
    4 days ago
  • $224k - $356.5k

     ...world. As part of the DGX Cloud organization, the...  ...authenticity of NVIDIA systems at scale. You’ll own highly available...  ..., silicon, and cloud engineering teams to turn...  ...silicon, platform, and software teams to deliver end-...  ...cloud-native platforms: Kubernetes, Docker/containers,... 
    Cloud
    Senior
    Remote work

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • $168k - $322k

    NVIDIA Gruppe is seeking a Senior AI Platform Engineer to improve engineering efficiency...  ...role involves working with Cloud and AI/ML teams to build and scale infrastructure and shape the...  ...expertise in distributed systems along with Kubernetes. Competitive compensation between... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • $184k - $287.5k

    Senior Systems Software Engineer (SRE) at NVIDIA is an engineering discipline to design, build and maintain large scale production systems with high efficiency and availability...  ...and open source cloud enabling technologies like Kubernetes and OpenStack. Senior Systems... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • NVIDIA Gruppe is seeking a Senior Software Engineer specializing in Resilience Engineering for DGX Cloud. This role emphasizes building and maintaining high reliability...  ...in operational practices for large-scale systems and strong software engineering skills in Go... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    11 hours ago
  • Nvidia Corporation in Santa Clara, California, is seeking a Systems Engineer to design and deploy Kubernetes solutions for large-scale data platforms. This role requires extensive experience with Kubernetes and a strong analytical mindset to enhance the reliability and... 
    Cloud
    Senior

    NVIDIA

    Santa Clara, CA
    3 days ago
  • $200k - $322k

     ...manager for NVIDIA's DGX Cloud. We want...  ...providers and NVIDIA engineering teams, building outstanding...  ...large programs, software engineering...  ...in Enterprise scale lean process development...  ...alignment across senior and executive...  ...is big plus Kubernetes Platform Experience... 
    Cloud
    Senior
    Full time

    NVIDIA

    Santa Clara, CA
    1 day ago
  • $168k - $258.75k

     ...NVIDIA's DGX Cloud (DGXC) powers AI for strategic research...  ..., and operational scale. The TPM also guides architectural...  ...cloud infrastructure, software, operations, and...  ...work closely with engineering, SRE, operations, and...  ...teams to identify systemic risks, resolve cross-stack... 
    Cloud
    Senior

    NVIDIA

    Santa Clara, CA
    3 days ago
  •  ...Senior Systems Software Engineer This role has been designed as "Onsite" with an expectation that you...  ...Packard Enterprise is the global edge-to-cloud company advancing the way people...  ...Cisco is a strong plus Knowledge of Kubernetes and associated technologies... 
    Cloud
    Senior
    Work experience placement
    Work at office
    2 days per week

    Hewlett Packard Enterprise

    Sunnyvale, CA
    2 days ago
  • $168k - $258.75k

     ...Senior Technical Program Manager As a...  ...you will lead the DGX Cloud Fleet Health reporting...  ..., SRE, and Engineering teams to translate...  ...diverse and rapidly scaling GPU fleet....  ...ambiguous, multi-system problems and translate...  ...infrastructure, Kubernetes, or large-scale distributed... 
    Cloud
    Senior

    NVIDIA

    Santa Clara, CA
    1 day ago
  •  ...passionate about building world-class reliability systems? Join NVIDIA as a Senior Software Engineer - Resilience Engineering, DGX Cloud, and be a pivotal part of a team that...  ...seasoned engineer with experience in running large-scale systems and a deep understanding of... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    5 days ago
  • $200k - $322k

    NVIDIA’s DGX Cloud is redefining how organizations deploy and scale AI infrastructure. We’re looking for a Senior Technical Program Manager...  ...interfacing with engineering, product, operations...  ...of large‑scale software or infrastructure...  ...Storage Systems: SAN, NAS, object... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • Senior Technical Lead - DevOps, Python, Kubernetes | United States | Santa Clara, CA Job...  ...Data Services Lead Engineer to own the technical...  ...distributed data systems. You will guide the...  ...PostgreSQL) across cloud and On‑Prem environments...  ...managing large‑scale, mission‑critical distributed... 
    Cloud
    Senior

    HCL Technologies

    Santa Clara, CA
    4 days ago
  • Senior Systems Software Engineer - Advanced Infrastructure Software Team We are seeking a Senior Systems Software...  ...maintaining high-performance, rack-scale management solutions for datacenter...  ...that bridge hardware, firmware, and cloud-native services. What you’ll be doing... 
    Cloud
    Senior

    NVIDIA

    Santa Clara, CA
    2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Systems Software Engineer, Kubernetes Scale - DGX Cloud. Be the first to apply!