Senior Software Engineer - NVLink Rack Scale Stability and Reliability
$152kNVIDIA
NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence. We are looking for highly motivated Senior Software Engineers to join our Fabric Networking team with a targeted focus on NVLink Rack-Scale Systems Stability & Reliability. In this role, you will partner closely with architects and developers building our next-generation NVLink and NVSwitch systems, helping transform first-of-their-kind platforms into stable, reliable, and volume production-ready systems. You will work on complex system-level challenges spanning resiliency, diagnostics, recovery, and large-scale AI infrastructure, contributing directly to the software foundation powering next-generation datacenter deployments. What you will be doing: Drive platform bringup, feature enablement, end-to-end software validation, and debug for next-generation NVLink-based GPU and rack-scale systems. Develop tools, diagnostics, automation, and infrastructure for system validation, regression testing, and fleet support. Lead reliability and MTBI validation through stress testing, telemetry analysis, failure injection, and issue resolution. Triage complex software, firmware, networking, and platform issues across validation, deployment, and production environments. Collaborate with architecture, hardware, firmware, software, and Customer engagement teams to improve system quality and reliability. Build and maintain SRE-style validation infrastructure, including provisioning, monitoring, and operational readiness. Create automation, dashboards, runbooks, and debug workflows that improve root-cause analysis and operational efficiency. What we need to see: BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or related field, or equivalent experience. 5+ years of experience in system software, firmware, networking, platform enablement, data center infrastructure, or distributed systems. Strong programming skills in C/C++ and Python; Bash/Shell scripting experience is a plus. Strong system-level debugging across software, firmware, hardware, and networking layers. Solid networking fundamentals, including TCP/IP, Ethernet and/or InfiniBand, RDMA/RoCE, routing, switching, and fabric performance analysis. Experience with large-scale AI systems, including platform bringup, validation, reliability engineering, stress testing, telemetry analysis, and root-cause debugging. Ability to triage complex multi-domain issues using logs, telemetry, experiments, and structured debugging methods. Strong communication and collaboration skills across engineering, customer, and operations teams. Passion for building reliable next-generation AI infrastructure and solving complex system-level challenges at scale. Ways to stand out from the crowd: Experience with NVIDIA GPU systems, NVLink, NVSwitch, CUDA, and large-scale AI/HPC clusters such as NVIDIA GB200 NVL72. Strong understanding of large-scale AI system architecture, including PCIe, memory hierarchy, DMA, high-speed interconnects, and distributed training/inference systems. Experience with server management technologies, data center operations, cluster provisioning, scaling, and fleet monitoring. Proven experience building diagnostics, automation, CI/CD pipelines, dashboards, and reliability tooling. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 152,000 USD - 241,500 USD for Level 3, and 184,000 USD - 287,500 USD for Level 4. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until June 18, 2026. This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. NVIDIA pioneered accelerated computing. Today, our AI infrastructure powers global intelligence, transforming every industry. Learn more about NVIDIA.
- NVIDIA Corporation is seeking a Senior Systems Software Engineer to join its advanced infrastructure software team in Santa Clara, California. You... ..., developing, and maintaining high-performance, rack-scale management solutions. The role emphasizes work in Rust,...Senior
$170k - $200k
...Senior Software Engineer – Core Database Location: Sunnyvale, United States... ...in designing, building, and scaling the foundational components... ...YugabyteDB remains robust, reliable, and high-performing at scale... ...and resolve correctness, stability, and performance issues across...SeniorWork at officeLocal area2 days per week3 days per week$184k - $287.5k
.... Join NVIDIA's software infrastructure team... ...systems for rack, networking, and... ...management. As a Senior Software Engineer - Datacenter Systems... ...supporting large-scale GPU clusters connected through NVLink and InfiniBand. These... ...and Site Reliability Engineering (SRE)...SeniorFull time- ...larger than GPUs. Our novel wafer-scale architecture provides the AI... ...for a deeply technical, hands-on software engineer to join our on-field Kernel Reliability team. You'll help tackle a critical... ...in the world. # Enjoy job stability with startup vitality. # Our simple...SuggestedInternship
$182k - $242k
...enables innovators to build and scale AI with confidence. Trusted... ...role We're looking for a Senior Engineer for CoreWeave's... ...to latency, throughput, and reliability across multiple services. You... ...critical GPU systems (CUDA, NCCL, NVLink/PCIe, memory bandwidth) or model...SeniorPermanent employmentTemporary workCasual workWork at officeFlexible hours$199.7k - $254.6k
...Senior Software Engineer In Application Reliability This position is based in San Jose, CA or North Carolina and operates under a hybrid work model. Join... ...teams to bring trusted AI to life at an enterprise scale. We are a fast-growing, highly collaborative team...SeniorFull timeTemporary workFlexible hours$153k - $242k
...enables innovators to build and scale AI with confidence. Trusted by leading... ...more at About the Role As a Senior Software Engineer within our Compute Architecture... ...needed to manage GPU servers and rack-scale systems with reliability and confidence. This is a...SeniorPermanent employmentTemporary workCasual workWork at officeFlexible hours$180k - $200k
...a time. As our Site Reliability Engineer, you will design, build, and... .... You will bring a software engineering approach to operations... ...Ensure the reliability and stability of our production environments... ...Capacity Planning and Scaling: Assist in capacity planning...SeniorFor contractorsWork at officeWork from homeFlexible hours- ..., high‑volume telemetry into reliable, job‑centric insights and automation... ...Join our team of innovative engineers who are building this... ...on. You’ll partner with the Software Engineering and Systems Engineering... ...(deploying, debugging, scaling) for telemetry‑heavy microservices...Senior
$165k - $242k
...Senior Software Engineer, Data Center Infrastructure Tooling CoreWeave is The... ...innovators to build and scale AI with confidence. Trusted... ...spatial relationships across racks, rows, and floors. The schema... ..., observability, and reliability practices. What We're...Senior$148k - $235.75k
A leading technology company is seeking a Senior Systems Software Engineer to enable features on GPU systems. The role involves debugging, collaborating with multiple teams, and developing automation tools. Candidates should have 5+ years of experience in software verification...Senior$184k - $287.5k
...the world. We are looking for a dedicated engineer for the Senior Systems Software Engineer role, focusing on GPU Performance at Scale. At NVIDIA, this role is uniquely... ...Decompose high‑complexity performance or stability issues into minimal reproduction cases, working...Senior$123.4k - $145k
...power that is affordable, reliable, and targeting a net-... ...to join a world-class engineering team: to build and run... ...: Design and scale robust test frameworks... ...validate complex hardware/software integrations and resolve... ...mitigation to ensure test stability and environment...SeniorLocal areaFlexible hours$152k - $204k
...Senior Software Engineer, Inference Sunnyvale, CA / Bellevue, WA CoreWeave is The Essential Cloud... ...that enables innovators to build and scale AI with confidence. Trusted by... ...improvements to latency, throughput, and reliability across multiple services. You'll partner...SeniorPermanent employmentTemporary workCasual workWork at officeFlexible hoursShift work$139k - $242k
...Senior Software Engineer, Sandboxes & Virtualization Livingston, NJ / New York, NY / Sunnyvale,... ...that enables innovators to build and scale AI with confidence. Trusted by leading... ...diagnosing and resolving complex performance, reliability, or isolation issues across...SeniorPermanent employmentTemporary workCasual workWork at officeFlexible hours- ...Description Role Overview As a Senior Software Simulation Validation Engineer, you will be a technical leader... ...responsible for ensuring the quality and reliability of autonomous vehicle simulation... ...and aggregate metric outputs at scale. Create, maintain,and...SeniorLocal areaWork from home
$139k - $204k
...Senior Software Engineer I, Inference Sunnyvale, CA / Bellevue, WA CoreWeave is The Essential Cloud... ...that enables innovators to build and scale AI with confidence. Trusted by... ...improvements to latency, throughput, and reliability across multiple services. You'll partner...SeniorPermanent employmentTemporary workCasual workWork at officeRemote workFlexible hoursShift work$179.06k - $198.95k
...behavior, and rapid recovery at scale. We've been named a Leader... ...highly skilled and motivated engineer to design, develop, and... ...passionate about designing for scale, reliability, and operational excellence... ...to run efficiently as Software-as-a-Service (SaaS) on leading...SeniorHourly payFull timeWork at office2 days per week3 days per week$154.42k - $235.9k
...and developer experience that make complex systems reliable, observable, and fast. As a Senior Software Engineer, you will design and deliver the core... ...scheduling, and production-grade reliability at scale. What you'll do Own design and implementation...SeniorPermanent employmentLocal areaWork from homeRelocationRelocation packageFlexible hours$300.81k
.... Our scientists, engineers, sales executives, and... ...medicine at scale, this is where you belong... ...for a high-performing Senior Software Engineer, Prenatal to... ...components to enhance reliability and developer velocity... ...long term system stability Build and maintain...SeniorWorldwide2 days per week3 days per week$184k - $287.5k
...cutting-edge hardware and software innovation to deliver... ...a team of innovative engineers dedicated to solving... ...looking for an outstanding Senior Systems Software... ...real-world problems at scale. In this pivotal role,... ...large scale, ensuring reliability and efficiency. Build...SeniorFull timeWorldwide$224k - $356.5k
NVLink Team - Senior Software Developer / Technical Lead The NVIDIA NVLink team is seeking a Senior Software... ...with product, test, applications engineering, production/manufacturing, and... ...around building, code quality, and reliability. Proven track record of tech leading...Senior$213.51k - $230k
...looking for an exceptional Senior Software Engineer to help shape the future of... ...innovation, critical thinking, and scale that don't always have... ..., and tooling that enable reliable, efficient software... ...to ensure production system stability and availability. Define and...SeniorWork at officeRemote workFlexible hoursShift work3 days per week- ...to join Youlify as we scale rapidly, from serving... ...Experience Level: Mid-Senior level About the role We... ...for a highly skilled software engineer to help build and scale... ...help evolve a scalable, reliable platform as the... ...Startup culture with stability - Move fast and build...SeniorFull time
- ...the forefront of software and hardware innovation... ...Qualification Engineer, Senior Staff Location:... ...of quality and reliability for our next-generation... ...substrates to Rack scale systems — can... ...bandwidth interconnects (NVLink/UALink... ...PCIe Gen5/6 link stability. Cross-Functional...SeniorContract work
$166k - $201k
...sense of urgency, who believe in the scale of our ambition and thrive on a path not... ...Crusoe. About This Role: As a Senior Software Engineer on our storage team, you'll be joining... ...for building highly performant, reliable, and scalable distributed storage systems...SeniorTemporary work$165k - $242k
...that enables innovators to build and scale AI with confidence. Trusted by leading... ...Learn more at What You'll Do As a Senior Software Engineer on the Identity & Access Management (... ...components. Experience building reliable and scalable platform services that process...SeniorPermanent employmentTemporary workCasual workWork at officeFlexible hours$139k - $220k
...enables innovators to build and scale AI with confidence. Trusted... .... Our team empowers engineers to understand, troubleshoot,... ...About the role: As a Senior Software Engineer on the Observability... ...will involve developing highly reliable and scalable systems, collaborating...SeniorPermanent employmentTemporary workCasual workWork at officeFlexible hours$160k - $253k
...networking, and full‑stack software to power AI at scale. To help customers... ...future, we are seeking a Senior Technical Marketing Engineer focused on scale‑out... ...millions of GPUs across racks, clusters, and even between... ..., InfiniBand, RoCE, NVLink interconnects, and large...Senior$153k - $204k
...enables innovators to build and scale AI with confidence. Trusted... ...You'll Do Reporting to the Engineering Manager for Vis/Media at... ...real user feedback Improve reliability, performance, and... ...experience building user-facing software, with a strong focus on frontend...SeniorPermanent employmentTemporary workCasual workWork at officeFlexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior Software Engineer - NVLink Rack Scale Stability and Reliability. Be the first to apply!
- software engineer internship remote Santa Clara, CA
- new grad software engineer Santa Clara, CA
- software engineer staff Santa Clara, CA
- integration software engineer Santa Clara, CA
- machine learning software engineer Santa Clara, CA
- senior robotics software engineer Santa Clara, CA
- software development engineer aws Santa Clara, CA
- startup software engineer Santa Clara, CA
- rust software engineer Santa Clara, CA
- part time software developer remote Santa Clara, CA


