Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior Software Engineer, DGX Cloud AI Infrastructure

$184k - $287.5k

NVIDIA

NVIDIA is at the forefront of the generative AI revolution, building the software and systems that power the world’s most advanced large language model workloads. We are looking for a Senior Software Engineer to lead the bring-up, triage, benchmarking, analysis, and optimization of distributed training and inference workloads across NVIDIA GPU platforms at the largest scales we run.

In this role you will set technical direction across communication libraries, model frameworks, and inference/training stacks to ensure state-of-the-art LLM workloads run efficiently and reliably at scale. You will lead deep performance and reliability investigations on multi-GPU and multi-node deployments, define how we benchmark and qualify new platforms, and build the resilience and failure-attribution capabilities that keep large clusters productive. This is a hands-on senior individual-contributor role for an engineer who operates at the intersection of deep learning systems, GPU performance, distributed computing, and large-scale operations — and who raises the bar for the engineers around them.

What you’ll be doing:

  • Lead bring-up, validation, and debugging of large-scale AI clusters, infrastructure, and end-to-end workloads, setting the standard for how the team operates.

  • Bring up, tune, and benchmark AI pre-training, post-training, and inference workloads using PyTorch, NeMo / Megatron, TensorRT-LLM, and adjacent NVIDIA AI software stacks.

  • Profile and optimize end-to-end workload performance across compute, memory, networking, and communication layers using tools such as Nsight Systems, NCCL tests, and custom microbenchmarks.

  • Analyze scaling efficiency for distributed LLM workloads using data, tensor, pipeline, and expert parallelism across modern GPU clusters, and translate findings into concrete tuning guidance.

  • Own root-cause analysis of complex failures — hangs, performance regressions, topology sensitivity in large distributed environments.

  • Define and build the resilience and failure-attribution stack: detecting, triaging, and attributing node, fabric, and workload failures across the cluster at scale.

  • Build repeatable benchmark suites, automation, acceptance criteria, and qualification workflows on new platforms.

  • Tune runtime settings, communication parameters, and deployment configurations in close partnership with framework, systems, and platform teams.

  • Deliver actionable, data-driven recommendations based on profiling, benchmark results, and cluster characterization.

  • Mentor engineers, drive technical standards, and act as a force multiplier across the broader performance and infrastructure organization.

What we need to see:

  • Bachelor’s or Master’s in Computer Science or a related technical field (or equivalent experience).

  • 8+ years of experience developing software infrastructure for large-scale AI or HPC systems, including a track record of technical leadership.

  • Expertise debugging and triaging AI applications across the full stack — from the application layer down to the hardware.

  • Deep hands-on experience with NCCL, CUDA-aware distributed execution, and debugging multi-GPU and multi-node workloads at scale.

  • Proven track record of architecting, debugging, and scaling large-scale distributed systems.

  • Expert-level Python and C/C++ programming skills.

  • Experience operating workloads in scheduled, containerized cluster environments.

  • Excellent analytical, debugging, and communication skills, with the ability to influence across teams.

Ways to stand out from the crowd:

  • Demonstrated experience debugging and optimizing AI workloads at large scale.

  • Deep familiarity with the RDMA software stack (NCCL, IB verbs, UCX, libfabric).

  • Strong knowledge of GPU cluster fabrics and topology, including NVLink, NVSwitch, PCIe, RoCE, and InfiniBand.

  • Experience building acceptance tests, benchmark harnesses, regression gates, or cluster qualification tooling for AI platforms.

  • Experience building resilience, fault-detection, or failure-attribution systems for datacenter-scale infrastructure.

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you’re creative, autonomous, and love a challenge, we want to hear from you.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until June 8, 2026.

This posting is for an existing vacancy. 

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
Vacancy posted 12 hours ago
Similar jobs that could be interesting for youBased on the Senior Software Engineer, DGX Cloud AI Infrastructure in Santa Clara, CA vacancy
  • $184k - $287.5k

     ...Joining NVIDIA's DGX Cloud Lepton Team means contributing...  ...powers innovative AI research and...  ...developing scalable AI infrastructure services globally. We...  ...an AI infrastructure software engineer to join our team. You...  ...production. As a senior DGX Cloud AI Infrastructure... 
    Senior
    Software

    NVIDIA

    Santa Clara, CA
    4 days ago
  • $152k - $241.5k

     ...Senior AI Infrastructure Engineer NVIDIA is looking for an outstanding, passionate, and dedicated Senior...  ...Infrastructure Engineer to join our DGX Cloud group. This engineering role will...  ...availability using a combination of software and systems engineering practices. This... 
    Senior
    Software

    NVIDIA

    Santa Clara, CA
    13 hours ago
  • $184k - $287.5k

     ...Joining NVIDIA's DGX Cloud AI Efficiency Team means contributing to the infrastructure that powers our innovative AI research....  ...seeking an AI infrastructure software engineer to join our team. You'll be instrumental...  ...of AI systems. As a senior DGX Cloud AI Infrastructure... 
    Senior
    Software

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $356.5k

    NVIDIA Gruppe is seeking an experienced AI infrastructure software engineer to join its DGX Cloud AI Efficiency Team in Santa Clara, California. This role focuses on developing the infrastructure for optimizing AI workloads and ensuring high availability and efficiency... 
    Senior
    Software

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • $184k - $287.5k

     ...NVIDIA DGX Cloud is building and operating large-scale GPU infrastructure for AI research and production workloads. We are looking for Senior Software Engineers to help build the automation, tooling, and operational systems that make GPU clusters reliable, scalable, and... 
    Senior
    Software
    Remote work

    NVIDIA

    Santa Clara, CA
    10 days ago
  • $384k

     ...NVIDIA is seeking a Senior Director, System Software Engineering, to lead strategy and execution...  ...capacity management in DGX Cloud, building the capacity...  ...foundation for NVIDIA's internal AI research clusters. This...  ...architecture, cloud infrastructure, or large-scale systems... 
    Senior
    Software

    NVIDIA

    Santa Clara, CA
    4 days ago
  • $208k - $327.75k

     ...seeking a world-class Senior Product Manager...  ...of Enterprise AI. While the NVIDIA DGX is the...  ...invisible as the public cloud? The mission is...  ...this role, own the software-defined...  ...to self-healing infrastructure. Thoughtfully define...  ...intersection of multiple engineering fields. As you... 
    Senior
    Software
    Night shift

    NVIDIA Corporation

    Santa Clara, CA
    2 days ago
  • $168k - $264.5k

     ...NVIDIA is looking for a Senior Network Engineer to develop a cloud network infrastructure. The goal is to craft a reliable, scalable...  ...network to support NVIDIA software development workflows and tools,...  ...existing vacancy. NVIDIA uses AI tools in its recruiting... 
    Senior
    Software
    Remote work

    NVIDIA

    Santa Clara, CA
    9 days ago
  • $141k - $202k

    A leading technology company in California is seeking a Software Engineer to develop innovative technologies that change how users connect...  .... The ideal candidate will collaborate on large-scale infrastructure and AI solutions, requiring programming experience in C++,... 
    Senior
    Software

    Google Inc.

    Sunnyvale, CA
    2 days ago
  • $136k - $224.25k

     ...NVIDIA is looking for a Senior Network Reliability Engineer to support and maintain our cloud and datacenter network infrastructures. This network serves the needs across the whole software stack for NVIDIA, from Graphics...  ...vacancy. NVIDIA uses AI tools in its recruiting... 
    Senior
    Software
    Remote work
    Shift work

    NVIDIA

    Santa Clara, CA
    3 days ago
  • $224k - $356.5k

     ...into the unlimited potential of AI to define the next era of...  ...the world. As part of the DGX Cloud organization, the Attestation...  ...security, silicon, and cloud engineering teams to turn embedded hardware...  ..., silicon, platform, and software teams to deliver end-to-end trust... 
    Senior
    Software
    Remote work

    NVIDIA

    Santa Clara, CA
    4 days ago
  •  ...Inclusion. We weave AI into the fabric of everything...  ...Networks, Secure Cloud and AI infrastructure is the foundation of...  ...-class Principal Engineer (Sr Manager-equivalent...  ...elevate our standards for software quality, and unlock...  ...platforms, mentoring senior engineers and... 
    Senior
    Software
    Full time
    Work at office
    3 days per week

    Palo Alto Networks

    Santa Clara, CA
    12 hours ago
  • $174k - $252k

    Senior Software Engineer, Google Cloud Compute Infrastructure Benefits for this role include: Health, dental, vision, life, disability insurance Retirement Benefits...  ..., maintain, and enhance software solutions. The AI and Infrastructure team is redefining what’s possible... 
    Senior
    Software
    Full time
    Temporary work
    Worldwide

    Reporter Newspapers

    Sunnyvale, CA
    1 day ago
  • $166k - $244k

    Senior Software Engineer, Infrastructure, Google Cloud AI Apply info_outline info_outline X Note: By applying to this position you will have an opportunity to share your preferred working location from the following: Sunnyvale, CA, USA; Mountain View, CA, USA . Bachelor... 
    Senior
    Software
    Full time
    Worldwide

    Google Inc.

    Sunnyvale, CA
    4 days ago
  •  ...seeking a Technical Marketing Engineer for its Cloud Accelerator team in Santa...  ...role focuses on developing AI solutions and helping cloud...  ...in data centers and infrastructure, and familiarity with cloud...  ...for NVIDIA’s infrastructure software. The position offers a competitive... 
    Software

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • $272k - $431.25k

     ...looking for a Principal Software Engineer to join our DGX Cloud team and build the foundational...  ...’s high-performance GPU infrastructure. You will play a...  ...that fuels the future of AI and cloud computing. What...  ...mentoring, and encouraging senior engineers, elevating the... 
    Software

    NVIDIA

    Santa Clara, CA
    1 day ago
  • $116k - $189.75k

     ...at the forefront of the generative AI revolution, building the software and systems that power the world’s...  .... We are looking for a Software Engineer focused on bring-up, triage, benchmarking...  ...debug large-scale AI clusters, infrastructure, and end-to-end workloads.... 
    Software

    NVIDIA

    Santa Clara, CA
    12 hours ago
  • $200k - $322k

     ...As a Senior Technical Program Manager passionate about Cloud Security, you will drive the DGX Cloud infrastructure security program that improves...  ...roadmaps and the software development...  ...Compliance, SRE, and Engineering to continually advance...  ...the future of AI infrastructure... 
    Senior
    Software

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $200k - $322k

     ...TPM) to join our NVIDIA DGX Cloud team. This is a...  ...extensive experience in cloud infrastructure bring-up and relationship...  ...with companies and engineering teams internally to help build AI capacity and infrastructure...  ..., Infrastructure, Software teams and their leadership... 
    Senior
    Software

    NVIDIA

    Santa Clara, CA
    4 days ago
  • $200k - $322k

     ...NVIDIA’s DGX Cloud is redefining how organizations deploy and scale AI infrastructure. We’re looking for a Senior Technical Program Manager to drive storage...  ...role interfacing with engineering, product, operations,...  ...management of large-scale software or infrastructure projects... 
    Senior
    Software

    NVIDIA

    Santa Clara, CA
    4 days ago
  • $200k - $322k

     ...NVIDIA is seeking a Senior Technical Program Manager...  ...Services programs for DGX Cloud. DGX Cloud powers large-scale AI infrastructure across NVIDIA, cloud service...  ...security, compliance, engineering execution, and partner...  ..., platform, and software teams. Establish program... 
    Senior
    Software

    NVIDIA

    Santa Clara, CA
    2 days ago
  • A leading technology company is seeking a Software Engineer to develop next-generation technologies that change how billions of users...  ...Science or related fields. Join a dynamic team at the forefront of AI and infrastructure innovation. #J-18808-Ljbffr Google Inc.
    Senior
    Software

    Google Inc.

    Sunnyvale, CA
    1 day ago
  • $168k - $258.75k

     ...NVIDIA's DGX Cloud (DGXC) powers AI for strategic research and product workloads...  ...programs spanning DGXC infrastructure, Resilience Tools, and...  ...with cloud infrastructure, software, operations, and environments...  ...will work closely with engineering, SRE, operations, and... 
    Senior
    Software

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $168k - $258.75k

     ...program manager for NVIDIA's DGX Cloud. We need passionate, hard-...  ...be a key partner to our Engineering, Infrastructure, and Software teams, driving critical...  ...maturing how we bring up AI capacity — strengthening process...  ...effectively with senior collaborators ~ BS/MS... 
    Senior
    Software

    NVIDIA

    Santa Clara, CA
    3 days ago
  • $200k - $322k

     ...NVIDIA's DGX Cloud (DGXC) powers AI for strategic research and product...  ...The company seeks a Senior Technical Program...  ...next-generation AI software platforms. In this role...  ...services, cloud infrastructure, and system integration...  ...high-impact engineering programs within a dynamic... 
    Senior
    Software

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $200k - $322k

     ...DGX Cloud Team is looking for a Senior Technical Program Manager (TPM) to guide complex, cross...  ...NVIDIA’s next-generation AI infrastructure. This position involves leading software-related initiatives across...  ...for managing high-impact engineering programs within a dynamic,... 
    Senior
    Software
    Shift work

    NVIDIA

    Santa Clara, CA
    4 days ago
  • $184k - $287.5k

     ...We are looking for a Senior Software Engineer to become part of our storage management plane team....  ...and supervise our distributed storage infrastructure. Our team is continually dedicated to...  ...recently, GPU deep learning ignited modern AI - the next era of computing - with... 
    Senior
    Software
    Remote work

    NVIDIA

    Santa Clara, CA
    5 days ago
  • $160k - $200k

    PlusAI, based in Silicon Valley, is seeking a Senior ML Infrastructure Engineer to design scalable architectures for machine learning models. This...  ...have a PhD or MS in a relevant field and experience in software engineering, specifically in ML infrastructure. The position... 
    Senior
    Software

    PlusAI

    Santa Clara, CA
    3 days ago
  • $272k - $431.25k

     ...Principal Software Engineer NVIDIA DGX Cloud is scaling GPU infrastructure across internal, partner, and cloud environments. We...  ...GPU clusters. This role is for senior technical leaders who can define...  ...Experience with GPU clusters, AI/ML infrastructure, Kubernetes operators... 
    Software

    NVIDIA

    Santa Clara, CA
    3 days ago
  • $184k - $287.5k

    Senior Software Engineer, Cloud-Native Stack - CSP Engagements page is loaded Senior Software...  ...multi-rack, multi-tenant AI/ML datacenters with NVIDIA...  ..., OpenTelemetry), and infrastructure-as-code. Excellent communication...  ...Senior Software Engineer, DGX Cloud Lepton Marketplace... 
    Senior
    Software
    Full time

    NVIDIA Corporation

    Santa Clara, CA
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Software Engineer, DGX Cloud AI Infrastructure. Be the first to apply!