Senior Software Engineer, DGX Cloud AI Infrastructure
$184k - $287.5kNVIDIA
NVIDIA is at the forefront of the generative AI revolution, building the software and systems that power the world’s most advanced large language model workloads. We are looking for a Senior Software Engineer to lead the bring-up, triage, benchmarking, analysis, and optimization of distributed training and inference workloads across NVIDIA GPU platforms at the largest scales we run.
In this role you will set technical direction across communication libraries, model frameworks, and inference/training stacks to ensure state-of-the-art LLM workloads run efficiently and reliably at scale. You will lead deep performance and reliability investigations on multi-GPU and multi-node deployments, define how we benchmark and qualify new platforms, and build the resilience and failure-attribution capabilities that keep large clusters productive. This is a hands-on senior individual-contributor role for an engineer who operates at the intersection of deep learning systems, GPU performance, distributed computing, and large-scale operations — and who raises the bar for the engineers around them.
What you’ll be doing:
Lead bring-up, validation, and debugging of large-scale AI clusters, infrastructure, and end-to-end workloads, setting the standard for how the team operates.
Bring up, tune, and benchmark AI pre-training, post-training, and inference workloads using PyTorch, NeMo / Megatron, TensorRT-LLM, and adjacent NVIDIA AI software stacks.
Profile and optimize end-to-end workload performance across compute, memory, networking, and communication layers using tools such as Nsight Systems, NCCL tests, and custom microbenchmarks.
Analyze scaling efficiency for distributed LLM workloads using data, tensor, pipeline, and expert parallelism across modern GPU clusters, and translate findings into concrete tuning guidance.
Own root-cause analysis of complex failures — hangs, performance regressions, topology sensitivity in large distributed environments.
Define and build the resilience and failure-attribution stack: detecting, triaging, and attributing node, fabric, and workload failures across the cluster at scale.
Build repeatable benchmark suites, automation, acceptance criteria, and qualification workflows on new platforms.
Tune runtime settings, communication parameters, and deployment configurations in close partnership with framework, systems, and platform teams.
Deliver actionable, data-driven recommendations based on profiling, benchmark results, and cluster characterization.
Mentor engineers, drive technical standards, and act as a force multiplier across the broader performance and infrastructure organization.
What we need to see:
Bachelor’s or Master’s in Computer Science or a related technical field (or equivalent experience).
8+ years of experience developing software infrastructure for large-scale AI or HPC systems, including a track record of technical leadership.
Expertise debugging and triaging AI applications across the full stack — from the application layer down to the hardware.
Deep hands-on experience with NCCL, CUDA-aware distributed execution, and debugging multi-GPU and multi-node workloads at scale.
Proven track record of architecting, debugging, and scaling large-scale distributed systems.
Expert-level Python and C/C++ programming skills.
Experience operating workloads in scheduled, containerized cluster environments.
Excellent analytical, debugging, and communication skills, with the ability to influence across teams.
Ways to stand out from the crowd:
Demonstrated experience debugging and optimizing AI workloads at large scale.
Deep familiarity with the RDMA software stack (NCCL, IB verbs, UCX, libfabric).
Strong knowledge of GPU cluster fabrics and topology, including NVLink, NVSwitch, PCIe, RoCE, and InfiniBand.
Experience building acceptance tests, benchmark harnesses, regression gates, or cluster qualification tooling for AI platforms.
Experience building resilience, fault-detection, or failure-attribution systems for datacenter-scale infrastructure.
NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you’re creative, autonomous, and love a challenge, we want to hear from you.
Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.You will also be eligible for equity and benefits.
Applications for this job will be accepted at least until June 8, 2026.This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.$184k - $287.5k
...Joining NVIDIA's DGX Cloud Lepton Team means contributing... ...powers innovative AI research and... ...developing scalable AI infrastructure services globally. We... ...an AI infrastructure software engineer to join our team. You... ...production. As a senior DGX Cloud AI Infrastructure...SeniorSoftware$152k - $241.5k
...Senior AI Infrastructure Engineer NVIDIA is looking for an outstanding, passionate, and dedicated Senior... ...Infrastructure Engineer to join our DGX Cloud group. This engineering role will... ...availability using a combination of software and systems engineering practices. This...SeniorSoftware$184k - $287.5k
...Joining NVIDIA's DGX Cloud AI Efficiency Team means contributing to the infrastructure that powers our innovative AI research.... ...seeking an AI infrastructure software engineer to join our team. You'll be instrumental... ...of AI systems. As a senior DGX Cloud AI Infrastructure...SeniorSoftware$356.5k
NVIDIA Gruppe is seeking an experienced AI infrastructure software engineer to join its DGX Cloud AI Efficiency Team in Santa Clara, California. This role focuses on developing the infrastructure for optimizing AI workloads and ensuring high availability and efficiency...SeniorSoftware$184k - $287.5k
...NVIDIA DGX Cloud is building and operating large-scale GPU infrastructure for AI research and production workloads. We are looking for Senior Software Engineers to help build the automation, tooling, and operational systems that make GPU clusters reliable, scalable, and...SeniorSoftwareRemote work$384k
...NVIDIA is seeking a Senior Director, System Software Engineering, to lead strategy and execution... ...capacity management in DGX Cloud, building the capacity... ...foundation for NVIDIA's internal AI research clusters. This... ...architecture, cloud infrastructure, or large-scale systems...SeniorSoftware$208k - $327.75k
...seeking a world-class Senior Product Manager... ...of Enterprise AI. While the NVIDIA DGX is the... ...invisible as the public cloud? The mission is... ...this role, own the software-defined... ...to self-healing infrastructure. Thoughtfully define... ...intersection of multiple engineering fields. As you...SeniorSoftwareNight shift$168k - $264.5k
...NVIDIA is looking for a Senior Network Engineer to develop a cloud network infrastructure. The goal is to craft a reliable, scalable... ...network to support NVIDIA software development workflows and tools,... ...existing vacancy. NVIDIA uses AI tools in its recruiting...SeniorSoftwareRemote work$141k - $202k
A leading technology company in California is seeking a Software Engineer to develop innovative technologies that change how users connect... .... The ideal candidate will collaborate on large-scale infrastructure and AI solutions, requiring programming experience in C++,...SeniorSoftware$136k - $224.25k
...NVIDIA is looking for a Senior Network Reliability Engineer to support and maintain our cloud and datacenter network infrastructures. This network serves the needs across the whole software stack for NVIDIA, from Graphics... ...vacancy. NVIDIA uses AI tools in its recruiting...SeniorSoftwareRemote workShift work$224k - $356.5k
...into the unlimited potential of AI to define the next era of... ...the world. As part of the DGX Cloud organization, the Attestation... ...security, silicon, and cloud engineering teams to turn embedded hardware... ..., silicon, platform, and software teams to deliver end-to-end trust...SeniorSoftwareRemote work- ...Inclusion. We weave AI into the fabric of everything... ...Networks, Secure Cloud and AI infrastructure is the foundation of... ...-class Principal Engineer (Sr Manager-equivalent... ...elevate our standards for software quality, and unlock... ...platforms, mentoring senior engineers and...SeniorSoftwareFull timeWork at office3 days per week
$174k - $252k
Senior Software Engineer, Google Cloud Compute Infrastructure Benefits for this role include: Health, dental, vision, life, disability insurance Retirement Benefits... ..., maintain, and enhance software solutions. The AI and Infrastructure team is redefining what’s possible...SeniorSoftwareFull timeTemporary workWorldwide$166k - $244k
Senior Software Engineer, Infrastructure, Google Cloud AI Apply info_outline info_outline X Note: By applying to this position you will have an opportunity to share your preferred working location from the following: Sunnyvale, CA, USA; Mountain View, CA, USA . Bachelor...SeniorSoftwareFull timeWorldwide- ...seeking a Technical Marketing Engineer for its Cloud Accelerator team in Santa... ...role focuses on developing AI solutions and helping cloud... ...in data centers and infrastructure, and familiarity with cloud... ...for NVIDIA’s infrastructure software. The position offers a competitive...Software
$272k - $431.25k
...looking for a Principal Software Engineer to join our DGX Cloud team and build the foundational... ...’s high-performance GPU infrastructure. You will play a... ...that fuels the future of AI and cloud computing. What... ...mentoring, and encouraging senior engineers, elevating the...Software$116k - $189.75k
...at the forefront of the generative AI revolution, building the software and systems that power the world’s... .... We are looking for a Software Engineer focused on bring-up, triage, benchmarking... ...debug large-scale AI clusters, infrastructure, and end-to-end workloads....Software$200k - $322k
...As a Senior Technical Program Manager passionate about Cloud Security, you will drive the DGX Cloud infrastructure security program that improves... ...roadmaps and the software development... ...Compliance, SRE, and Engineering to continually advance... ...the future of AI infrastructure...SeniorSoftware$200k - $322k
...TPM) to join our NVIDIA DGX Cloud team. This is a... ...extensive experience in cloud infrastructure bring-up and relationship... ...with companies and engineering teams internally to help build AI capacity and infrastructure... ..., Infrastructure, Software teams and their leadership...SeniorSoftware$200k - $322k
...NVIDIA’s DGX Cloud is redefining how organizations deploy and scale AI infrastructure. We’re looking for a Senior Technical Program Manager to drive storage... ...role interfacing with engineering, product, operations,... ...management of large-scale software or infrastructure projects...SeniorSoftware$200k - $322k
...NVIDIA is seeking a Senior Technical Program Manager... ...Services programs for DGX Cloud. DGX Cloud powers large-scale AI infrastructure across NVIDIA, cloud service... ...security, compliance, engineering execution, and partner... ..., platform, and software teams. Establish program...SeniorSoftware- A leading technology company is seeking a Software Engineer to develop next-generation technologies that change how billions of users... ...Science or related fields. Join a dynamic team at the forefront of AI and infrastructure innovation. #J-18808-Ljbffr Google Inc.SeniorSoftware
$168k - $258.75k
...NVIDIA's DGX Cloud (DGXC) powers AI for strategic research and product workloads... ...programs spanning DGXC infrastructure, Resilience Tools, and... ...with cloud infrastructure, software, operations, and environments... ...will work closely with engineering, SRE, operations, and...SeniorSoftware$168k - $258.75k
...program manager for NVIDIA's DGX Cloud. We need passionate, hard-... ...be a key partner to our Engineering, Infrastructure, and Software teams, driving critical... ...maturing how we bring up AI capacity — strengthening process... ...effectively with senior collaborators ~ BS/MS...SeniorSoftware$200k - $322k
...NVIDIA's DGX Cloud (DGXC) powers AI for strategic research and product... ...The company seeks a Senior Technical Program... ...next-generation AI software platforms. In this role... ...services, cloud infrastructure, and system integration... ...high-impact engineering programs within a dynamic...SeniorSoftware$200k - $322k
...DGX Cloud Team is looking for a Senior Technical Program Manager (TPM) to guide complex, cross... ...NVIDIA’s next-generation AI infrastructure. This position involves leading software-related initiatives across... ...for managing high-impact engineering programs within a dynamic,...SeniorSoftwareShift work$184k - $287.5k
...We are looking for a Senior Software Engineer to become part of our storage management plane team.... ...and supervise our distributed storage infrastructure. Our team is continually dedicated to... ...recently, GPU deep learning ignited modern AI - the next era of computing - with...SeniorSoftwareRemote work$160k - $200k
PlusAI, based in Silicon Valley, is seeking a Senior ML Infrastructure Engineer to design scalable architectures for machine learning models. This... ...have a PhD or MS in a relevant field and experience in software engineering, specifically in ML infrastructure. The position...SeniorSoftware$272k - $431.25k
...Principal Software Engineer NVIDIA DGX Cloud is scaling GPU infrastructure across internal, partner, and cloud environments. We... ...GPU clusters. This role is for senior technical leaders who can define... ...Experience with GPU clusters, AI/ML infrastructure, Kubernetes operators...Software$184k - $287.5k
Senior Software Engineer, Cloud-Native Stack - CSP Engagements page is loaded Senior Software... ...multi-rack, multi-tenant AI/ML datacenters with NVIDIA... ..., OpenTelemetry), and infrastructure-as-code. Excellent communication... ...Senior Software Engineer, DGX Cloud Lepton Marketplace...SeniorSoftwareFull time
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior Software Engineer, DGX Cloud AI Infrastructure. Be the first to apply!
- software engineer full time Santa Clara, CA
- startup software engineer Santa Clara, CA
- rust software engineer Santa Clara, CA
- work from home software developer Santa Clara, CA
- software developer Santa Clara, CA
- software development engineer aws Santa Clara, CA
- software qa engineer Santa Clara, CA
- ngo software engineer Santa Clara, CA
- software engineer staff Santa Clara, CA
- software engineer Santa Clara, CA

