Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior Software Engineer - NVLink Rack Scale Stability and Reliability

$152k
Full-time

NVIDIA

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence. We are looking for highly motivated Senior Software Engineers to join our Fabric Networking team with a targeted focus on NVLink Rack-Scale Systems Stability & Reliability. In this role, you will partner closely with architects and developers building our next-generation NVLink and NVSwitch systems, helping transform first-of-their-kind platforms into stable, reliable, and volume production-ready systems. You will work on complex system-level challenges spanning resiliency, diagnostics, recovery, and large-scale AI infrastructure, contributing directly to the software foundation powering next-generation datacenter deployments. What you will be doing: Drive platform bringup, feature enablement, end-to-end software validation, and debug for next-generation NVLink-based GPU and rack-scale systems. Develop tools, diagnostics, automation, and infrastructure for system validation, regression testing, and fleet support. Lead reliability and MTBI validation through stress testing, telemetry analysis, failure injection, and issue resolution. Triage complex software, firmware, networking, and platform issues across validation, deployment, and production environments. Collaborate with architecture, hardware, firmware, software, and Customer engagement teams to improve system quality and reliability. Build and maintain SRE-style validation infrastructure, including provisioning, monitoring, and operational readiness. Create automation, dashboards, runbooks, and debug workflows that improve root-cause analysis and operational efficiency. What we need to see: BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or related field, or equivalent experience. 5+ years of experience in system software, firmware, networking, platform enablement, data center infrastructure, or distributed systems. Strong programming skills in C/C++ and Python; Bash/Shell scripting experience is a plus. Strong system-level debugging across software, firmware, hardware, and networking layers. Solid networking fundamentals, including TCP/IP, Ethernet and/or InfiniBand, RDMA/RoCE, routing, switching, and fabric performance analysis. Experience with large-scale AI systems, including platform bringup, validation, reliability engineering, stress testing, telemetry analysis, and root-cause debugging. Ability to triage complex multi-domain issues using logs, telemetry, experiments, and structured debugging methods. Strong communication and collaboration skills across engineering, customer, and operations teams. Passion for building reliable next-generation AI infrastructure and solving complex system-level challenges at scale. Ways to stand out from the crowd: Experience with NVIDIA GPU systems, NVLink, NVSwitch, CUDA, and large-scale AI/HPC clusters such as NVIDIA GB200 NVL72. Strong understanding of large-scale AI system architecture, including PCIe, memory hierarchy, DMA, high-speed interconnects, and distributed training/inference systems. Experience with server management technologies, data center operations, cluster provisioning, scaling, and fleet monitoring. Proven experience building diagnostics, automation, CI/CD pipelines, dashboards, and reliability tooling. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 152,000 USD - 241,500 USD for Level 3, and 184,000 USD - 287,500 USD for Level 4. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until June 18, 2026. This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. NVIDIA pioneered accelerated computing. Today, our AI infrastructure powers global intelligence, transforming every industry. Learn more about NVIDIA.

Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the Senior Software Engineer - NVLink Rack Scale Stability and Reliability in Santa Clara, CA vacancy
  • NVIDIA Corporation is seeking a Senior Systems Software Engineer to join its advanced infrastructure software team in Santa Clara, California. You...  ..., developing, and maintaining high-performance, rack-scale management solutions. The role emphasizes work in Rust,... 
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    1 day ago
  • $170k - $200k

     ...Senior Software Engineer – Core Database Location: Sunnyvale, United States...  ...in designing, building, and scaling the foundational components...  ...YugabyteDB remains robust, reliable, and high-performing at scale...  ...and resolve correctness, stability, and performance issues across... 
    Senior
    Work at office
    Local area
    2 days per week
    3 days per week

    YugaByte

    Sunnyvale, CA
    11 hours ago
  • $184k - $287.5k

     .... Join NVIDIA's software infrastructure team...  ...systems for rack, networking, and...  ...management. As a Senior Software Engineer - Datacenter Systems...  ...supporting large-scale GPU clusters connected through NVLink and InfiniBand. These...  ...and Site Reliability Engineering (SRE)... 
    Senior
    Full time

    NVIDIA

    Santa Clara, CA
    4 days ago
  •  ...larger than GPUs. Our novel wafer-scale architecture provides the AI...  ...for a deeply technical, hands-on software engineer to join our on-field Kernel Reliability team. You'll help tackle a critical...  ...in the world. # Enjoy job stability with startup vitality. # Our simple... 
    Suggested
    Internship

    CEREBRAS SYSTEMS INC.

    Sunnyvale, CA
    2 days ago
  • $182k - $242k

     ...enables innovators to build and scale AI with confidence. Trusted...  ...role We're looking for a Senior Engineer for CoreWeave's...  ...to latency, throughput, and reliability across multiple services. You...  ...critical GPU systems (CUDA, NCCL, NVLink/PCIe, memory bandwidth) or model... 
    Senior
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    3 days ago
  • $199.7k - $254.6k

     ...Senior Software Engineer In Application Reliability This position is based in San Jose, CA or North Carolina and operates under a hybrid work model. Join...  ...teams to bring trusted AI to life at an enterprise scale. We are a fast-growing, highly collaborative team... 
    Senior
    Full time
    Temporary work
    Flexible hours

    Webex Events (formerly Socio)

    San Jose, CA
    3 days ago
  • $153k - $242k

     ...enables innovators to build and scale AI with confidence. Trusted by leading...  ...more at About the Role As a Senior Software Engineer within our Compute Architecture...  ...needed to manage GPU servers and rack-scale systems with reliability and confidence. This is a... 
    Senior
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    25 days ago
  • $180k - $200k

     ...a time. As our Site Reliability Engineer, you will design, build, and...  .... You will bring a software engineering approach to operations...  ...Ensure the reliability and stability of our production environments...  ...Capacity Planning and Scaling: Assist in capacity planning... 
    Senior
    For contractors
    Work at office
    Work from home
    Flexible hours

    PayNearMe, Inc.

    Santa Clara, CA
    20 days ago
  •  ..., high‑volume telemetry into reliable, job‑centric insights and automation...  ...Join our team of innovative engineers who are building this...  ...on. You’ll partner with the Software Engineering and Systems Engineering...  ...(deploying, debugging, scaling) for telemetry‑heavy microservices... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • $165k - $242k

     ...Senior Software Engineer, Data Center Infrastructure Tooling CoreWeave is The...  ...innovators to build and scale AI with confidence. Trusted...  ...spatial relationships across racks, rows, and floors. The schema...  ..., observability, and reliability practices. What We're... 
    Senior

    CoreWeave

    Sunnyvale, CA
    5 days ago
  • $148k - $235.75k

    A leading technology company is seeking a Senior Systems Software Engineer to enable features on GPU systems. The role involves debugging, collaborating with multiple teams, and developing automation tools. Candidates should have 5+ years of experience in software verification... 
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    4 days ago
  • $184k - $287.5k

     ...the world. We are looking for a dedicated engineer for the Senior Systems Software Engineer role, focusing on GPU Performance at Scale. At NVIDIA, this role is uniquely...  ...Decompose high‑complexity performance or stability issues into minimal reproduction cases, working... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • $123.4k - $145k

     ...power that is affordable, reliable, and targeting a net-...  ...to join a world-class engineering team: to build and run...  ...: Design and scale robust test frameworks...  ...validate complex hardware/software integrations and resolve...  ...mitigation to ensure test stability and environment... 
    Senior
    Local area
    Flexible hours

    Mainspring Energy

    Menlo Park, CA
    2 days ago
  • $152k - $204k

     ...Senior Software Engineer, Inference Sunnyvale, CA / Bellevue, WA CoreWeave is The Essential Cloud...  ...that enables innovators to build and scale AI with confidence. Trusted by...  ...improvements to latency, throughput, and reliability across multiple services. You'll partner... 
    Senior
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Flexible hours
    Shift work

    CoreWeave

    Sunnyvale, CA
    11 hours ago
  • $139k - $242k

     ...Senior Software Engineer, Sandboxes & Virtualization Livingston, NJ / New York, NY / Sunnyvale,...  ...that enables innovators to build and scale AI with confidence. Trusted by leading...  ...diagnosing and resolving complex performance, reliability, or isolation issues across... 
    Senior
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    11 hours ago
  •  ...Description Role Overview As a Senior Software Simulation Validation Engineer, you will be a technical leader...  ...responsible for ensuring the quality and reliability of autonomous vehicle simulation...  ...and aggregate metric outputs at scale. Create, maintain,and... 
    Senior
    Local area
    Work from home

    General Motors

    Sunnyvale, CA
    4 days ago
  • $139k - $204k

     ...Senior Software Engineer I, Inference Sunnyvale, CA / Bellevue, WA CoreWeave is The Essential Cloud...  ...that enables innovators to build and scale AI with confidence. Trusted by...  ...improvements to latency, throughput, and reliability across multiple services. You'll partner... 
    Senior
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Remote work
    Flexible hours
    Shift work

    CoreWeave

    Sunnyvale, CA
    11 hours ago
  • $179.06k - $198.95k

     ...behavior, and rapid recovery at scale. We've been named a Leader...  ...highly skilled and motivated engineer to design, develop, and...  ...passionate about designing for scale, reliability, and operational excellence...  ...to run efficiently as Software-as-a-Service (SaaS) on leading... 
    Senior
    Hourly pay
    Full time
    Work at office
    2 days per week
    3 days per week

    Cohesity

    Santa Clara, CA
    2 days ago
  • $154.42k - $235.9k

     ...and developer experience that make complex systems reliable, observable, and fast. As a Senior Software Engineer, you will design and deliver the core...  ...scheduling, and production-grade reliability at scale. What you'll do Own design and implementation... 
    Senior
    Permanent employment
    Local area
    Work from home
    Relocation
    Relocation package
    Flexible hours

    General Motors

    Sunnyvale, CA
    1 day ago
  • $300.81k

     .... Our scientists, engineers, sales executives, and...  ...medicine at scale, this is where you belong...  ...for a high-performing Senior Software Engineer, Prenatal to...  ...components to enhance reliability and developer velocity...  ...long term system stability Build and maintain... 
    Senior
    Worldwide
    2 days per week
    3 days per week

    BillionToOne

    Menlo Park, CA
    3 days ago
  • $184k - $287.5k

     ...cutting-edge hardware and software innovation to deliver...  ...a team of innovative engineers dedicated to solving...  ...looking for an outstanding Senior Systems Software...  ...real-world problems at scale. In this pivotal role,...  ...large scale, ensuring reliability and efficiency. Build... 
    Senior
    Full time
    Worldwide

    NVIDIA

    Santa Clara, CA
    4 days ago
  • $224k - $356.5k

    NVLink Team - Senior Software Developer / Technical Lead The NVIDIA NVLink team is seeking a Senior Software...  ...with product, test, applications engineering, production/manufacturing, and...  ...around building, code quality, and reliability. Proven track record of tech leading... 
    Senior

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $213.51k - $230k

     ...looking for an exceptional Senior Software Engineer to help shape the future of...  ...innovation, critical thinking, and scale that don't always have...  ..., and tooling that enable reliable, efficient software...  ...to ensure production system stability and availability. Define and... 
    Senior
    Work at office
    Remote work
    Flexible hours
    Shift work
    3 days per week

    Robinhood

    Menlo Park, CA
    4 days ago
  •  ...to join Youlify as we scale rapidly, from serving...  ...Experience Level: Mid-Senior level About the role We...  ...for a highly skilled software engineer to help build and scale...  ...help evolve a scalable, reliable platform as the...  ...Startup culture with stability - Move fast and build... 
    Senior
    Full time

    Youlify

    Palo Alto, CA
    1 day ago
  •  ...the forefront of software and hardware innovation...  ...Qualification Engineer, Senior Staff Location:...  ...of quality and reliability for our next-generation...  ...substrates to Rack scale systems — can...  ...bandwidth interconnects (NVLink/UALink...  ...PCIe Gen5/6 link stability. Cross-Functional... 
    Senior
    Contract work

    d-Matrix inc.

    Santa Clara, CA
    4 days ago
  • $166k - $201k

     ...sense of urgency, who believe in the scale of our ambition and thrive on a path not...  ...Crusoe. About This Role: As a Senior Software Engineer on our storage team, you'll be joining...  ...for building highly performant, reliable, and scalable distributed storage systems... 
    Senior
    Temporary work

    Crusoe

    Sunnyvale, CA
    10 days ago
  • $165k - $242k

     ...that enables innovators to build and scale AI with confidence. Trusted by leading...  ...Learn more at What You'll Do As a Senior Software Engineer on the Identity & Access Management (...  ...components. Experience building reliable and scalable platform services that process... 
    Senior
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    27 days ago
  • $139k - $220k

     ...enables innovators to build and scale AI with confidence. Trusted...  .... Our team empowers engineers to understand, troubleshoot,...  ...About the role: As a Senior Software Engineer on the Observability...  ...will involve developing highly reliable and scalable systems, collaborating... 
    Senior
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    25 days ago
  • $160k - $253k

     ...networking, and full‑stack software to power AI at scale. To help customers...  ...future, we are seeking a Senior Technical Marketing Engineer focused on scale‑out...  ...millions of GPUs across racks, clusters, and even between...  ..., InfiniBand, RoCE, NVLink interconnects, and large... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • $153k - $204k

     ...enables innovators to build and scale AI with confidence. Trusted...  ...You'll Do Reporting to the Engineering Manager for Vis/Media at...  ...real user feedback Improve reliability, performance, and...  ...experience building user-facing software, with a strong focus on frontend... 
    Senior
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    10 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Software Engineer - NVLink Rack Scale Stability and Reliability. Be the first to apply!