Senior Systems Software Engineer, Kubernetes Scale - DGX Cloud
NVIDIA Corporation
The DGX Cloud organization at NVIDIA brings together cutting‑edge hardware and software innovation to deliver industry‑leading accelerated computing for the world’s most adventurous AI workloads. We are a team of innovative engineers dedicated to solving some of the world’s biggest challenges, constantly driving advancements, and impacting millions of lives worldwide. We are looking for an outstanding Senior Systems Software Engineer with deep experience in distributed systems, open‑source technologies such as Kubernetes and containers, and a strong background in systems performance and scalability. The ideal candidate brings broad, end‑to‑end experience across the stack – from GPU operator and device plugins to distributed inference serving and cloud platforms – along with the technical depth to investigate and address exciting, real‑world problems at scale. What You’ll Be Doing: Drive end‑to‑end performance and scale characterization for the NVIDIA DGX Cloud software stack, from Kubernetes control and data planes through NVIDIA components such as GPU Operator, Network Operator, DCGM, NIM, and distributed inference serving, following issues from orchestration down to the metal. Collaborate with AI researchers, developers and customers to develop innovative, automated tests that simulate real user workloads using custom‑built and leading open‑source tools and frameworks. Deep dive into performance and scale issues in complex distributed systems, including interactions between Kubernetes and the NVIDIA software stack, to identify and resolve root causes. Design and develop monitoring, reporting and analysis tools for performance and scale testing across software, GPU and CPU resources. Triage, debug and root cause issues related to operating Kubernetes clusters at ultra‑large scale, ensuring reliability and efficiency. Build and maintain a high‑velocity framework that enables continuous, always‑on performance and scale testing via a modern CI/CD pipeline. Document research, methodologies and results clearly and concisely, and present findings at internal and external venues, including community conferences such as KubeCon and GTC. Engage efficiently with upstream communities – including Kubernetes, CNCF and NVIDIA open‑source projects – to validate performance and scalability of AI workloads early and help shape design and development decisions. What We Need to See: 8+ years of experience in Computer Architecture, Networking, Storage systems, Accelerators and a Bachelor’s/Master’s in Engineering (preferably Electrical Engineering, Computer Engineering, or Computer Science) or equivalent experience. Expertise in Kubernetes and familiarity with related CNCF projects. Background in working with large‑scale parallel and distributed accelerator‑based systems. Expertise optimizing performance and AI workloads on large‑scale systems. Experience with performance modeling and benchmarking at scale. Proficiency in Golang/Python. Background with the NVIDIA software ecosystem in both training and inference domains. Expertise with at least one of public CSP infrastructure (GCP, AWS, Azure, OCI, etc.). Ways to Stand Out From the Crowd: Strong operational experience with any one of the Kubernetes distributions. Prior experience scaling Kubernetes clusters to ultra‑large node and object counts. Demonstrated history of working in the open‑source community. Excellent communication and interpersonal abilities. PhD in relevant areas. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000USD – 287,500USD for Level4, and 224,000USD – 356,500USD for Level5. You will also be eligible for equity and benefits. NVIDIA is committed to fostering an inclusive work environment and is proud to be an equal‑opportunity employer. We do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. #J-18808-Ljbffr NVIDIA Corporation
$184k - $287.5k
...on the world. The DGX Cloud organization at... ...edge hardware and software innovation to deliver... ...forward‑thinking engineers tackling some of... ...searching for a Senior Systems Software Engineer... ...systems, Kubernetes, containers, and... ...problems at large scale and help shape how...CloudSeniorFull timeRemote work$184k - $356.5k
NVIDIA Corporation is seeking a Senior Systems Software Engineer based in Santa Clara, California.... ...characterization for the NVIDIA DGX Cloud software stack and collaborating... ...and a related degree. Knowledge of Kubernetes and large-scale systems is essential. Competitive...CloudSenior$184k - $287.5k
Senior Software Engineer - GPU Cloud Infrastructure We are looking for a Senior Software... ...upstream communities such as Kubernetes (k8s) and KubeVirt, adding... ...methodology for high‑scale, high‑availability services... ...operations). Own and document system and software architecture,...CloudSeniorWorldwide- Senior Systems Software Engineer - GPU Performance at Scale We are looking for a dedicated engineer for the Senior Systems Software Engineer role, focusing on GPU... ...computing software stacks (CUDA). Experience with modern cloud and container‑based enterprise computing...CloudSenior
$272k - $431.25k
...Corporation is looking for a Principal Software Engineer for DGX Cloud Production Engineering to... ...and lead efforts in large-scale GPU operations. The successful... ...years of experience in distributed systems, with strong skills in Kubernetes and automation. Located in Santa...CloudRemote job$152k - $241.5k
NVIDIA Corporation is seeking a Senior AI Infrastructure Engineer in Santa Clara, California. This role involves designing, building, and maintaining large-scale production systems for AI services. Applicants should have a BS degree and at least 5 years of relevant experience...CloudSenior$356.5k
NVIDIA Gruppe is seeking an experienced AI infrastructure software engineer to join its DGX Cloud AI Efficiency Team in Santa Clara, California. This... ...workloads and ensuring high availability and efficiency of AI systems. The ideal candidate will have over 8 years of...CloudSenior$152k - $241.5k
...motivated Performance Engineer to influence the... ...demand and run at scales that reach tens of... ..., networking) and software components in the... ...of computer system architecture, hardware... ...with containers, cloud provisioning, and scheduling tools (Kubernetes, SLURM, Ansible, Docker...CloudSenior- ...a motivated Performance engineer to influence the roadmap... ...benchmarking and triage on large‑scale HPC clusters. Good understanding of computer system architecture, HW‑SW... ...Familiar with containers, cloud provisioning and scheduling tools (Kubernetes, SLURM, Ansible, Docker)....CloudSenior
- GeForce NOW is Nvidia’s Cloud Gaming service, streaming games... .... We are looking for a Senior System Software Engineer for Cloud who sees the big... ...Responsibilities Design, build, and scale distributed cloud‑based... ...drive best practices in Kubernetes, observability, and...CloudSenior
- ...NVIDIA is seeking a Senior Software Engineer to join our CSP Engagements... ...team, focusing on system software for... ...responsibilities to enable cloud service providers with... ...optimization for large‑scale data center... ...with virtualization, Kubernetes, and cloud‑native architectures...CloudSenior
- ...passionate, and dedicated Senior AI Infrastructure Engineer to join our DGX Cloud group. This... ...build and maintain large‑scale production systems with high efficiency... ...using a combination of software and systems engineering... ...technologies like Kubernetes and OpenStack. The DGX...CloudSenior
- GeForce NOW is NVIDIA’s Cloud Gaming service that... ...and proprietary software to deliver high‑quality... ...are looking for a Senior System Software Engineer for Cloud who sees the... ..., frameworks) on Kubernetes. Architect and manage... ...workload scheduling, auto‑scaling, multi‑cluster,...CloudSenior
- A leading technology company is looking for a Java SRE Engineer to support large-scale cloud migrations and production systems on AWS and Kubernetes. You will lead migrations, design robust AWS EKS platforms, and implement deployment strategies. The ideal candidate has...CloudSenior
$168k - $322k
NVIDIA Gruppe is seeking a Senior AI Platform Engineer to improve engineering efficiency... ...role involves working with Cloud and AI/ML teams to build and scale infrastructure and shape the... ...expertise in distributed systems along with Kubernetes. Competitive compensation between...CloudSenior$224k - $356.5k
...world. As part of the DGX Cloud organization, the... ...authenticity of NVIDIA systems at scale. You’ll own highly available... ..., silicon, and cloud engineering teams to turn... ...silicon, platform, and software teams to deliver end-... ...cloud-native platforms: Kubernetes, Docker/containers,...CloudSeniorRemote work- NVIDIA Gruppe is seeking a Senior Software Engineer specializing in Resilience Engineering for DGX Cloud. This role emphasizes building and maintaining high reliability... ...in operational practices for large-scale systems and strong software engineering skills in Go...CloudSenior
- Nvidia Corporation in Santa Clara, California, is seeking a Systems Engineer to design and deploy Kubernetes solutions for large-scale data platforms. This role requires extensive experience with Kubernetes and a strong analytical mindset to enhance the reliability and...CloudSenior
- Senior Systems Software Engineer (SRE) at NVIDIA is an engineering discipline to design, build and maintain large scale production systems with high efficiency and availability... ...and open source cloud enabling technologies like Kubernetes and OpenStack. Senior Systems...CloudSenior
- ...Senior Systems Software Engineer – Advanced Infrastructure Software Team We are seeking a Senior Systems Software... ...maintaining high-performance, rack-scale management solutions for datacenter... ...that bridge hardware, firmware, and cloud-native services. What you’ll be doing...CloudSenior
- ...passionate about building world-class reliability systems? Join NVIDIA as a Senior Software Engineer - Resilience Engineering, DGX Cloud, and be a pivotal part of a team that... ...seasoned engineer with experience in running large-scale systems and a deep understanding of...CloudSenior
$168k - $258.75k
Senior Technical Program Manager, DGX Cloud Software Products and Services page is loaded## Senior... ...stability, and operational scale. The TPM also guides... ...will work closely with engineering, SRE, operations, and researchers... ...teams to identify systemic risks, resolve cross-...CloudSenior$200k - $322k
NVIDIA’s DGX Cloud is redefining how organizations deploy and scale AI infrastructure. We’re looking for a Senior Technical Program Manager... ...interfacing with engineering, product, operations... ...of large‑scale software or infrastructure... ...Storage Systems: SAN, NAS, object...CloudSenior- Joining NVIDIA's DGX Cloud AI Efficiency Team means contributing... ...resources and scale to foster innovation.... ...an AI infrastructure software engineer to join our team. You'... ...implementing software and systems engineering practices... ...of AI systems. As a senior DGX Cloud AI...CloudSenior
$208k - $333.5k
Systems Engineering is an engineering discipline focused on building... ...that deliver large-scale production systems... .... It combines software and systems engineering... ...across domains such as Kubernetes and container orchestration... ...external facing GPU cloud services are deployed...CloudSeniorFlexible hours- NVIDIA Gruppe in Santa Clara is seeking a Senior Software Engineer to design and build next-generation cloud platforms. This role involves developing scalable solutions... ...using advanced technologies such as GPUs and Kubernetes. Ideal candidates will have over 7 years of...CloudSenior
$224k - $356.5k
...on the extensive scale-up of key AI solutions... ...NVIDIA's internal cloud infrastructure.... ...thousands of NVIDIA software developers... ...various operating systems (Windows/Linux/Android... ...also involve guiding engineers in solving complex... ..., Docker, Kubernetes, Chef/Puppet, Hadoop...CloudSeniorWork experience placementWorldwide$176k - $276k
Production engineering is a field that involves crafting... ...and maintaining large-scale production systems with high efficiency and... ...areas, including software and systems engineering... ...along with open-source cloud-enabling technologies such as Kubernetes, containers, and virtualization...CloudSeniorFlexible hours$128k - $176k
...leading technology firm in Santa Clara seeks a DevOps Engineer with a strong background in CI/CD systems and cloud environments. The ideal candidate will have over... ...in DevOps, proficient in tools like Jenkins and Kubernetes, and will play a key role in developing...CloudSeniorFull time$170k - $200k
...seeking a talented Site Reliability Engineer to join their engineering team... ...build, maintain, and monitor cloud services, ensuring high availability and security of systems. The ideal candidate will have... ...expertise in CI/CD tools, Kubernetes, and automation. Offering a...CloudSenior
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior Systems Software Engineer, Kubernetes Scale - DGX Cloud. Be the first to apply!
- system programmer Santa Clara, CA
- systems software developer Santa Clara, CA
- IT system engineer Santa Clara, CA
- cloud developer Santa Clara, CA
- senior principal cloud computing engineer Santa Clara, CA
- aws cloud infrastructure engineer Santa Clara, CA
- principal cloud computing engineer Santa Clara, CA
- informatica cloud developer Santa Clara, CA
- software engineer - cloud services Santa Clara, CA
- cloud security engineer Santa Clara, CA

