Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior Systems Software Engineer, Kubernetes Scale - DGX Cloud

NVIDIA Corporation

The DGX Cloud organization at NVIDIA brings together cutting‑edge hardware and software innovation to deliver industry‑leading accelerated computing for the world’s most adventurous AI workloads. We are a team of innovative engineers dedicated to solving some of the world’s biggest challenges, constantly driving advancements, and impacting millions of lives worldwide. We are looking for an outstanding Senior Systems Software Engineer with deep experience in distributed systems, open‑source technologies such as Kubernetes and containers, and a strong background in systems performance and scalability. The ideal candidate brings broad, end‑to‑end experience across the stack – from GPU operator and device plugins to distributed inference serving and cloud platforms – along with the technical depth to investigate and address exciting, real‑world problems at scale. What You’ll Be Doing: Drive end‑to‑end performance and scale characterization for the NVIDIA DGX Cloud software stack, from Kubernetes control and data planes through NVIDIA components such as GPU Operator, Network Operator, DCGM, NIM, and distributed inference serving, following issues from orchestration down to the metal. Collaborate with AI researchers, developers and customers to develop innovative, automated tests that simulate real user workloads using custom‑built and leading open‑source tools and frameworks. Deep dive into performance and scale issues in complex distributed systems, including interactions between Kubernetes and the NVIDIA software stack, to identify and resolve root causes. Design and develop monitoring, reporting and analysis tools for performance and scale testing across software, GPU and CPU resources. Triage, debug and root cause issues related to operating Kubernetes clusters at ultra‑large scale, ensuring reliability and efficiency. Build and maintain a high‑velocity framework that enables continuous, always‑on performance and scale testing via a modern CI/CD pipeline. Document research, methodologies and results clearly and concisely, and present findings at internal and external venues, including community conferences such as KubeCon and GTC. Engage efficiently with upstream communities – including Kubernetes, CNCF and NVIDIA open‑source projects – to validate performance and scalability of AI workloads early and help shape design and development decisions. What We Need to See: 8+ years of experience in Computer Architecture, Networking, Storage systems, Accelerators and a Bachelor’s/Master’s in Engineering (preferably Electrical Engineering, Computer Engineering, or Computer Science) or equivalent experience. Expertise in Kubernetes and familiarity with related CNCF projects. Background in working with large‑scale parallel and distributed accelerator‑based systems. Expertise optimizing performance and AI workloads on large‑scale systems. Experience with performance modeling and benchmarking at scale. Proficiency in Golang/Python. Background with the NVIDIA software ecosystem in both training and inference domains. Expertise with at least one of public CSP infrastructure (GCP, AWS, Azure, OCI, etc.). Ways to Stand Out From the Crowd: Strong operational experience with any one of the Kubernetes distributions. Prior experience scaling Kubernetes clusters to ultra‑large node and object counts. Demonstrated history of working in the open‑source community. Excellent communication and interpersonal abilities. PhD in relevant areas. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000USD – 287,500USD for Level4, and 224,000USD – 356,500USD for Level5. You will also be eligible for equity and benefits. NVIDIA is committed to fostering an inclusive work environment and is proud to be an equal‑opportunity employer. We do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. #J-18808-Ljbffr NVIDIA Corporation

Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Senior Systems Software Engineer, Kubernetes Scale - DGX Cloud in Santa Clara, CA vacancy
  • $184k - $287.5k

     ...on the world. The DGX Cloud organization at...  ...edge hardware and software innovation to deliver...  ...forward‑thinking engineers tackling some of...  ...searching for a Senior Systems Software Engineer...  ...systems, Kubernetes, containers, and...  ...problems at large scale and help shape how... 
    Cloud
    Senior
    Full time
    Remote work

    NVIDIA

    Santa Clara, CA
    1 day ago
  • $184k - $356.5k

    NVIDIA Corporation is seeking a Senior Systems Software Engineer based in Santa Clara, California....  ...characterization for the NVIDIA DGX Cloud software stack and collaborating...  ...and a related degree. Knowledge of Kubernetes and large-scale systems is essential. Competitive... 
    Cloud
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    1 day ago
  • $184k - $287.5k

    Senior Software Engineer - GPU Cloud Infrastructure We are looking for a Senior Software...  ...upstream communities such as Kubernetes (k8s) and KubeVirt, adding...  ...methodology for high‑scale, high‑availability services...  ...operations). Own and document system and software architecture,... 
    Cloud
    Senior
    Worldwide

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • Senior Systems Software Engineer - GPU Performance at Scale We are looking for a dedicated engineer for the Senior Systems Software Engineer role, focusing on GPU...  ...computing software stacks (CUDA). Experience with modern cloud and container‑based enterprise computing... 
    Cloud
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    2 days ago
  • $272k - $431.25k

     ...Corporation is looking for a Principal Software Engineer for DGX Cloud Production Engineering to...  ...and lead efforts in large-scale GPU operations. The successful...  ...years of experience in distributed systems, with strong skills in Kubernetes and automation. Located in Santa... 
    Cloud
    Remote job

    NVIDIA

    Santa Clara, CA
    3 days ago
  • $152k - $241.5k

    NVIDIA Corporation is seeking a Senior AI Infrastructure Engineer in Santa Clara, California. This role involves designing, building, and maintaining large-scale production systems for AI services. Applicants should have a BS degree and at least 5 years of relevant experience... 
    Cloud
    Senior

    NVIDIA

    Santa Clara, CA
    3 days ago
  • $356.5k

    NVIDIA Gruppe is seeking an experienced AI infrastructure software engineer to join its DGX Cloud AI Efficiency Team in Santa Clara, California. This...  ...workloads and ensuring high availability and efficiency of AI systems. The ideal candidate will have over 8 years of... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • $152k - $241.5k

     ...motivated Performance Engineer to influence the...  ...demand and run at scales that reach tens of...  ..., networking) and software components in the...  ...of computer system architecture, hardware...  ...with containers, cloud provisioning, and scheduling tools (Kubernetes, SLURM, Ansible, Docker... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  •  ...a motivated Performance engineer to influence the roadmap...  ...benchmarking and triage on large‑scale HPC clusters. Good understanding of computer system architecture, HW‑SW...  ...Familiar with containers, cloud provisioning and scheduling tools (Kubernetes, SLURM, Ansible, Docker).... 
    Cloud
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago
  • GeForce NOW is Nvidia’s Cloud Gaming service, streaming games...  .... We are looking for a Senior System Software Engineer for Cloud who sees the big...  ...Responsibilities Design, build, and scale distributed cloud‑based...  ...drive best practices in Kubernetes, observability, and... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    18 hours ago
  •  ...NVIDIA is seeking a Senior Software Engineer to join our CSP Engagements...  ...team, focusing on system software for...  ...responsibilities to enable cloud service providers with...  ...optimization for large‑scale data center...  ...with virtualization, Kubernetes, and cloud‑native architectures... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  •  ...passionate, and dedicated Senior AI Infrastructure Engineer to join our DGX Cloud group. This...  ...build and maintain large‑scale production systems with high efficiency...  ...using a combination of software and systems engineering...  ...technologies like Kubernetes and OpenStack. The DGX... 
    Cloud
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago
  • GeForce NOW is NVIDIA’s Cloud Gaming service that...  ...and proprietary software to deliver high‑quality...  ...are looking for a Senior System Software Engineer for Cloud who sees the...  ..., frameworks) on Kubernetes. Architect and manage...  ...workload scheduling, auto‑scaling, multi‑cluster,... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • A leading technology company is looking for a Java SRE Engineer to support large-scale cloud migrations and production systems on AWS and Kubernetes. You will lead migrations, design robust AWS EKS platforms, and implement deployment strategies. The ideal candidate has... 
    Cloud
    Senior

    EITACIES Inc.

    Santa Clara, CA
    18 hours ago
  • $168k - $322k

    NVIDIA Gruppe is seeking a Senior AI Platform Engineer to improve engineering efficiency...  ...role involves working with Cloud and AI/ML teams to build and scale infrastructure and shape the...  ...expertise in distributed systems along with Kubernetes. Competitive compensation between... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • $224k - $356.5k

     ...world. As part of the DGX Cloud organization, the...  ...authenticity of NVIDIA systems at scale. You’ll own highly available...  ..., silicon, and cloud engineering teams to turn...  ...silicon, platform, and software teams to deliver end-...  ...cloud-native platforms: Kubernetes, Docker/containers,... 
    Cloud
    Senior
    Remote work

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • NVIDIA Gruppe is seeking a Senior Software Engineer specializing in Resilience Engineering for DGX Cloud. This role emphasizes building and maintaining high reliability...  ...in operational practices for large-scale systems and strong software engineering skills in Go... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • Nvidia Corporation in Santa Clara, California, is seeking a Systems Engineer to design and deploy Kubernetes solutions for large-scale data platforms. This role requires extensive experience with Kubernetes and a strong analytical mindset to enhance the reliability and... 
    Cloud
    Senior

    NVIDIA

    Santa Clara, CA
    4 days ago
  • Senior Systems Software Engineer (SRE) at NVIDIA is an engineering discipline to design, build and maintain large scale production systems with high efficiency and availability...  ...and open source cloud enabling technologies like Kubernetes and OpenStack. Senior Systems... 
    Cloud
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    1 day ago
  •  ...Senior Systems Software Engineer – Advanced Infrastructure Software Team We are seeking a Senior Systems Software...  ...maintaining high-performance, rack-scale management solutions for datacenter...  ...that bridge hardware, firmware, and cloud-native services. What you’ll be doing... 
    Cloud
    Senior

    NVIDIA

    Santa Clara, CA
    4 days ago
  •  ...passionate about building world-class reliability systems? Join NVIDIA as a Senior Software Engineer - Resilience Engineering, DGX Cloud, and be a pivotal part of a team that...  ...seasoned engineer with experience in running large-scale systems and a deep understanding of... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • $168k - $258.75k

    Senior Technical Program Manager, DGX Cloud Software Products and Services page is loaded## Senior...  ...stability, and operational scale. The TPM also guides...  ...will work closely with engineering, SRE, operations, and researchers...  ...teams to identify systemic risks, resolve cross-... 
    Cloud
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago
  • $200k - $322k

    NVIDIA’s DGX Cloud is redefining how organizations deploy and scale AI infrastructure. We’re looking for a Senior Technical Program Manager...  ...interfacing with engineering, product, operations...  ...of large‑scale software or infrastructure...  ...Storage Systems: SAN, NAS, object... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • Joining NVIDIA's DGX Cloud AI Efficiency Team means contributing...  ...resources and scale to foster innovation....  ...an AI infrastructure software engineer to join our team. You'...  ...implementing software and systems engineering practices...  ...of AI systems. As a senior DGX Cloud AI... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • $208k - $333.5k

    Systems Engineering is an engineering discipline focused on building...  ...that deliver large-scale production systems...  .... It combines software and systems engineering...  ...across domains such as Kubernetes and container orchestration...  ...external facing GPU cloud services are deployed... 
    Cloud
    Senior
    Flexible hours

    Nvidia Corporation

    Santa Clara, CA
    4 days ago
  • NVIDIA Gruppe in Santa Clara is seeking a Senior Software Engineer to design and build next-generation cloud platforms. This role involves developing scalable solutions...  ...using advanced technologies such as GPUs and Kubernetes. Ideal candidates will have over 7 years of... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    18 hours ago
  • $224k - $356.5k

     ...on the extensive scale-up of key AI solutions...  ...NVIDIA's internal cloud infrastructure....  ...thousands of NVIDIA software developers...  ...various operating systems (Windows/Linux/Android...  ...also involve guiding engineers in solving complex...  ..., Docker, Kubernetes, Chef/Puppet, Hadoop... 
    Cloud
    Senior
    Work experience placement
    Worldwide

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • $176k - $276k

    Production engineering is a field that involves crafting...  ...and maintaining large-scale production systems with high efficiency and...  ...areas, including software and systems engineering...  ...along with open-source cloud-enabling technologies such as Kubernetes, containers, and virtualization... 
    Cloud
    Senior
    Flexible hours

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $128k - $176k

     ...leading technology firm in Santa Clara seeks a DevOps Engineer with a strong background in CI/CD systems and cloud environments. The ideal candidate will have over...  ...in DevOps, proficient in tools like Jenkins and Kubernetes, and will play a key role in developing... 
    Cloud
    Senior
    Full time

    Victrays

    Santa Clara, CA
    18 hours ago
  • $170k - $200k

     ...seeking a talented Site Reliability Engineer to join their engineering team...  ...build, maintain, and monitor cloud services, ensuring high availability and security of systems. The ideal candidate will have...  ...expertise in CI/CD tools, Kubernetes, and automation. Offering a... 
    Cloud
    Senior

    Zoomcar

    Sunnyvale, CA
    18 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Systems Software Engineer, Kubernetes Scale - DGX Cloud. Be the first to apply!