Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior Software Engineer - NVLink Rack Scale Stability and Reliability

$152k
Full-time

NVIDIA

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence. We are looking for highly motivated Senior Software Engineers to join our Fabric Networking team with a targeted focus on NVLink Rack-Scale Systems Stability & Reliability. In this role, you will partner closely with architects and developers building our next-generation NVLink and NVSwitch systems, helping transform first-of-their-kind platforms into stable, reliable, and volume production-ready systems. You will work on complex system-level challenges spanning resiliency, diagnostics, recovery, and large-scale AI infrastructure, contributing directly to the software foundation powering next-generation datacenter deployments. What you will be doing: Drive platform bringup, feature enablement, end-to-end software validation, and debug for next-generation NVLink-based GPU and rack-scale systems. Develop tools, diagnostics, automation, and infrastructure for system validation, regression testing, and fleet support. Lead reliability and MTBI validation through stress testing, telemetry analysis, failure injection, and issue resolution. Triage complex software, firmware, networking, and platform issues across validation, deployment, and production environments. Collaborate with architecture, hardware, firmware, software, and Customer engagement teams to improve system quality and reliability. Build and maintain SRE-style validation infrastructure, including provisioning, monitoring, and operational readiness. Create automation, dashboards, runbooks, and debug workflows that improve root-cause analysis and operational efficiency. What we need to see: BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or related field, or equivalent experience. 5+ years of experience in system software, firmware, networking, platform enablement, data center infrastructure, or distributed systems. Strong programming skills in C/C++ and Python; Bash/Shell scripting experience is a plus. Strong system-level debugging across software, firmware, hardware, and networking layers. Solid networking fundamentals, including TCP/IP, Ethernet and/or InfiniBand, RDMA/RoCE, routing, switching, and fabric performance analysis. Experience with large-scale AI systems, including platform bringup, validation, reliability engineering, stress testing, telemetry analysis, and root-cause debugging. Ability to triage complex multi-domain issues using logs, telemetry, experiments, and structured debugging methods. Strong communication and collaboration skills across engineering, customer, and operations teams. Passion for building reliable next-generation AI infrastructure and solving complex system-level challenges at scale. Ways to stand out from the crowd: Experience with NVIDIA GPU systems, NVLink, NVSwitch, CUDA, and large-scale AI/HPC clusters such as NVIDIA GB200 NVL72. Strong understanding of large-scale AI system architecture, including PCIe, memory hierarchy, DMA, high-speed interconnects, and distributed training/inference systems. Experience with server management technologies, data center operations, cluster provisioning, scaling, and fleet monitoring. Proven experience building diagnostics, automation, CI/CD pipelines, dashboards, and reliability tooling. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 152,000 USD - 241,500 USD for Level 3, and 184,000 USD - 287,500 USD for Level 4. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until June 10, 2026. This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. NVIDIA pioneered accelerated computing. Today, our AI infrastructure powers global intelligence, transforming every industry. Learn more about NVIDIA.

Vacancy posted 2 days ago
Similar jobs that could be interesting for youBased on the Senior Software Engineer - NVLink Rack Scale Stability and Reliability in Santa Clara, CA vacancy
  • $272k - $431.25k

     ...as a Principal Rack Scale Systems Infrastructure Engineer, you will build...  ...development of software systems. These...  ...drivers, networking, NVLink domains,...  ...needs. Establish reliability, security, validation...  .... Mentor senior engineers and technical...  ...including API stability, modularity,... 
    Suggested
    Shift work

    NVIDIA

    Santa Clara, CA
    7 hours ago
  • NVIDIA Corporation is seeking a Senior Systems Software Engineer to join its advanced infrastructure software team in Santa Clara, California. You...  ..., developing, and maintaining high-performance, rack-scale management solutions. The role emphasizes work in Rust,... 
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    4 days ago
  • $207k - $300k

    Google Inc. is looking for a Staff Software Engineer specializing in Site Reliability Engineering in Sunnyvale, CA. This role combines software and systems engineering to build and manage distributed systems, ensuring high reliability and uptime. The ideal candidate should... 
    Senior

    Google Inc.

    Sunnyvale, CA
    3 days ago
  • $320k

     ...Gruppe in Santa Clara, California, is seeking a senior architect to define NVLink Fusion architecture and collaborate with...  ...server systems. This role involves establishing software abstraction layers and mentoring engineering teams while making key technical decisions.... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • $184k - $287.5k

     ...Join NVIDIA's software infrastructure team...  ...software systems for rack, networking, and...  ...management. As a Senior Software Engineer - Datacenter...  ...supporting large-scale GPU clusters connected through NVLink and InfiniBand. These...  ...and Site Reliability Engineering (SRE)... 
    Senior

    NVIDIA

    Santa Clara, CA
    3 days ago
  • Position Summary Senior Software Engineer - Android - Design, build, and scale customer‑facing mobile experiences powering Walmart’s...  ...debugging, root‑cause analysis, and stability improvements. Ensure performance, security, reliability, and accessibility through testing... 
    Senior

    Walmart

    Sunnyvale, CA
    3 days ago
  • $174k - $252k

    A leading tech company is seeking a Senior Software Engineer for Site Reliability Engineering based in Sunnyvale, CA. The role involves ensuring service reliability, leading technical projects, and enhancing systems performance. Candidates should have at least 5 years of... 
    Senior

    Google Inc.

    Sunnyvale, CA
    1 day ago
  • $145k - $165k

    A technology solutions firm in Sunnyvale, CA is looking for a highly experienced Site Reliability Engineer (SRE). This role involves maintaining uptime and performance across systems. Exceptional Linux expertise and automation skills in Bash and Python are crucial. Key... 
    Senior

    Bolt Graphics, Inc.

    Sunnyvale, CA
    2 days ago
  • $174k - $252k

    Senior Software Engineer, Site Reliability Engineering X Applicants in San Francisco: Qualified applications with arrest or conviction records will be considered...  ...in designing, analyzing, and troubleshooting large-scale distributed systems. 2 years of experience leading... 
    Senior
    Full time

    Google Inc.

    Sunnyvale, CA
    1 day ago
  • $320k

    Distinguished Engineer - Rack Scale Architecture page is loaded## Distinguished...  ...way up to large multi-node NVLink domain rack architectures....  ...NVIDIA AI and HPC software stack. We're searching for...  ...organization.* Ensure high quality & reliable software; serving as a... 
    Shift work

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago
  • $184k - $287.5k

     ...a creative and experienced Senior Software Engineer in Test to help us bring NVIDIA...  ...end to end Simulation at scale to evaluate Autonomous...  ...infrastructure.  Build reliable and scalable infrastructure...  ...and improve scalability and stability of the platform.  Collaborate... 
    Senior
    Remote work

    NVIDIA

    Santa Clara, CA
    4 days ago
  • NVIDIA Corporation is seeking a candidate to analyze large-scale datacenter workloads on GPU-accelerated clusters. Responsibilities include identifying application improvements and building visualizations for data analysis. The ideal candidate has 5+ years of experience... 
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    4 days ago
  • $170k - $200k

     ...Senior Software Engineer – Core Database Location: Sunnyvale, United States...  ...in designing, building, and scaling the foundational components...  ...YugabyteDB remains robust, reliable, and high-performing at scale...  ...and resolve correctness, stability, and performance issues across... 
    Senior
    Work at office
    Local area
    2 days per week
    3 days per week

    YugaByte

    Sunnyvale, CA
    3 days ago
  • $308k

     ...Distinguished Software Engineer - NVLink Fusion Software page is loaded Distinguished...  ...multi-node NVLink domain rack architectures. These designs...  ...enable industry-leading AI scale-up and scale-out performance...  ...protected by law. Similar Jobs (3) Senior Software Engineer,... 
    Full time
    Second job
    Shift work

    NVIDIA

    Santa Clara, CA
    7 hours ago
  •  ...larger than GPUs. Our novel wafer-scale architecture provides the AI...  ...for a deeply technical, hands‑on software engineer to join our on‑field Kernel Reliability team. You’ll help tackle a critical...  ...in the world. Enjoy job stability with startup vitality. Our simple... 
    Internship

    Dormont Manufacturing Company

    Sunnyvale, CA
    7 hours ago
  • $152k - $241.5k

     ...Overview Join a team that analyzes large‑scale datacenter workloads on GPU‑accelerated...  ...partner with OS, container, GPU, and systems engineers, and apply machine learning or deep...  ...classification or prediction) within existing software workflows. Qualifications 5+ years of... 
    Senior

    NVIDIA

    Santa Clara, CA
    8 hours ago
  • $152k - $241.5k

     ...is built. We are seeking a Senior Software Engineer – AI Inference to advance open...  ..., low‑latency inference at scale. This is a hands‑on role...  ...inference performance and reliability: parallelism strategies, communication...  ..., kernel fusion, PCIe/NVLink effects, and network... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  •  ...and security — partnering across engineering, security, compliance, and product...  ...AI to life at an enterprise scale. We are a fast‑growing, highly collaborative...  ...a global technology leader. As a Senior Software Engineer in Application Reliability, you will own the reliability of... 
    Senior

    Cisco

    San Jose, CA
    3 days ago
  • $182k - $242k

     ...enables innovators to build and scale AI with confidence. Trusted...  ...role We're looking for a Senior Engineer for CoreWeave's...  ...to latency, throughput, and reliability across multiple services. You...  ...critical GPU systems (CUDA, NCCL, NVLink/PCIe, memory bandwidth) or model... 
    Senior
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    26 days ago
  •  ..., high‑volume telemetry into reliable, job‑centric insights and automation...  ...Join our team of innovative engineers who are building this...  ...on. You’ll partner with the Software Engineering and Systems Engineering...  ...(deploying, debugging, scaling) for telemetry‑heavy microservices... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    7 hours ago
  •  ...team that ensures safe, reliable, and scalable releases...  ...Autonomous Vehicle (AV) software stack through...  ...accelerate the velocity and stability of AV releases by unifying software engineering, reliability analysis,...  ...solving skills across large-scale software systems. Experience... 
    Local area
    Work from home

    General Motors

    Sunnyvale, CA
    4 days ago
  • $200k - $322k

     ...NVIDIANs are inspired to excel and make a profound global impact. NVIDIA is seeking a Senior Manager of Site Reliability Engineering to lead and reshape how IT operations function at scale. This role goes beyond traditional service management to build AI-powered systems... 
    Senior

    NVIDIA

    Santa Clara, CA
    5 days ago
  • $224k - $356.5k

    NVIDIA NVLink team is seeking a Senior Software Developer or manager to serve as Tech Lead...  ...infrastructure at scale. What you will be doing:...  ...product, test, applications engineering, production/manufacturing...  ...building, code quality, and reliability. Proven track record of... 
    Senior
    Full time

    NVIDIA

    Santa Clara, CA
    4 days ago
  • $187.04k - $359.72k

     ...Senior Site Reliability Engineer, Reliability Team - USDS Location: San Jose Employment...  ...SRE) team at TikTok combines software and systems engineering to build and run large-scale, massively distributed, and...  ...to safeguard business stability 24/7. System Design & Optimization... 
    Senior
    Temporary work
    Local area

    Ellis Technologies, Inc.

    San Jose, CA
    7 hours ago
  • $153k - $242k

     ...enables innovators to build and scale AI with confidence. Trusted by leading...  ...more at About the Role As a Senior Software Engineer within our Compute Architecture...  ...needed to manage GPU servers and rack-scale systems with reliability and confidence. This is a... 
    Senior
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    18 days ago
  • $152k - $241.5k

     ...Overview We’re looking for a Senior SRE to join our Compute Farm...  ...Experience supporting large‑scale HPC clusters using Slurm, LSF...  ...lifecycle management, fleet reliability/auto‑healing, E2E observability...  ...Perl, or Ruby. Mentored other engineers and influenced technical direction... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    6 hours ago
  • $184k - $287.5k

    ## Senior Software Engineer, DGX Cloud AI InfrastructureApplylocations: US, CA,...  ...GPU platforms at the largest scales we run.In this role you...  ...workloads run efficiently and reliably at scale. You will lead...  ...fabrics and topology, including NVLink, NVSwitch, PCIe, RoCE, and... 
    Senior
    Remote work

    NVIDIA

    Santa Clara, CA
    8 hours ago
  • $148k - $235.75k

     ...A leading technology company is seeking a Senior Systems Software Engineer to enable features on GPU systems. The role involves debugging, collaborating with multiple teams, and developing automation tools. Candidates should have 5+ years of experience in software verification... 
    Senior

    NVIDIA

    Santa Clara, CA
    7 hours ago
  • $184k - $287.5k

     ...world. We are looking for a dedicated engineer for the Senior Systems Software Engineer role, focusing on GPU Performance at Scale. At NVIDIA, this role is uniquely positioned...  ...Decompose high-complexity performance or stability issues into minimal reproduction cases,... 
    Senior
    Remote work

    NVIDIA

    Santa Clara, CA
    1 day ago
  •  ...NVIDIA Corporation is seeking a Principal Rack Scale Systems Infrastructure Engineer in Santa Clara, California. In this role, you will define software architecture for rack-scale infrastructure products and mentor engineers while collaborating with hardware teams. The... 

    NVIDIA

    Santa Clara, CA
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Software Engineer - NVLink Rack Scale Stability and Reliability. Be the first to apply!