Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Principal Software Engineer, At-Scale Reliability and Fleet Intelligence — CSP Engagements

$272k - $431.25k

NVIDIA Gruppe

We're looking for a Principal Software Engineer to join our CSP Engagements team as the technical focal point for fleet-scale reliability, working directly with engineering teams of key CSP / hyperscale customers to ensure NVIDIA platforms achieve target MTBI (Mean Time Between Interruptions) in production. In this role, you will augment NVIDIA's internal software/firmware and quality teams with a dedicated CSP-facing focus. You will drive work streams with CSP engineering teams to build shared understanding of reliability software/firmware architecture, methodology, incorporate their fleet telemetry and failure data into NVIDIA's improvement priorities, and validate that reliability improvements measured in the lab translate to real customer environments. Your cross-CSP visibility enables you to distinguish systemic architectural gaps from environmental or configuration-specific issues that no single customer engagement could identify alone. What you'll be doing: Drive reliability work streams with CSP engineering teams — ensuring shared understanding of MTBI measurement methodology, failure classification, and health monitoring architecture Gather and synthesize CSP fleet reliability data — identify failure patterns that appear across multiple customers and champion improvements back into NVIDIA's firmware, driver, and hardware teams Define consistent MTBI measurement methodology that works across different CSP monitoring environments and operational practices Conduct fleet-scale failure pattern analysis using statistical methods (Pareto, survival analysis, Weibull) to classify failures as systemic, environmental, or configuration-specific Drive fleet health monitoring integration architecture — ensure NVIDIA's health agents, telemetry, and reporting align with CSP operational workflows and automation Define burn-in reliability test environment and cluster certification criteria in collaboration with quality teams, validating with customers that criteria are meaningful Collaborate with CSPs to ensure reliability-related integration work (health monitoring deployment, telemetry pipeline, alerting configuration) is complete ahead of at-scale launch Develop predictive failure models using fleet telemetry and validate their effectiveness in customer environments What we need to see: 15+ years of experience in systems software at datacenter scale, or reliability engineering with focus on at-scale challenges. BS or MS in Computer Science, Electrical Engineering, Statistics, or related field (or equivalent experience) Deep expertise in multi-NUMA, rack-scale system software and firmware. Statistical failure analysis methods: MTBF/MTBI calculation, Pareto analysis, root cause classification Experience with fleet-level telemetry and observability systems: time-series databases, anomaly detection, health scoring, event correlation Understanding of hardware failure modes in large-scale GPU/accelerator deployments — ability to classify and prioritize across compute, interconnect, memory, power, and thermal domains Experience defining or operating burn-in, stress testing, or certification frameworks for complex hardware systems. Familiarity with predictive maintenance or anomaly detection approaches applied to fleet health data Customer obsession — genuine passion for understanding fleet reliability challenges at scale and translating them into actionable engineering priorities Strong communication — ability to present statistical reliability findings to both deep technical audiences and executive leadership. Demonstrated success driving cross-functional improvements across hardware, firmware, and software teams without direct authority Ways to stand out from the crowd: Experience in fleet reliability at a hyperscaler (hardware health, fleet reliability at leading CSP/Hyperscaler) Familiarity with NVIDIA GPU error taxonomy (Xid errors, NVLink error counters, thermal events, CPER records) Experience building health scoring or predictive failure models for accelerator or HPC infrastructure Background in defining MTBI/MTBF measurement standards or certification programs for complex multi-component systems Understanding of how reliability data flows from device firmware through telemetry pipelines to fleet-level dashboards and automated remediation NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. We have some of the most forward-thinking and hardworking people on the planet working for us. If you're creative, hardworking and self-motivated, we want to hear from you! Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 272,000 USD - 431,250 USD. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until June 30, 2026. This posting is for an existing vacancy. NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. #J-18808-Ljbffr NVIDIA Gruppe

Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Principal Software Engineer, At-Scale Reliability and Fleet Intelligence — CSP Engagements in Santa Clara, CA vacancy
  • $272k - $431.25k

    We’re looking for a Principal Software Engineer to join our CSP Engagements team as the technical focal point for fleet-scale reliability, working directly with engineering teams of key CSP / hyperscale customers to ensure NVIDIA platforms achieve target MTBI (Mean Time... 
    Intelligence
    Fleet

    NVIDIA Corporation

    Santa Clara, CA
    1 day ago
  • $272k - $431.25k

    We're looking for a Principal Software Engineer to join our CSP Engagements team as the technical focal point for rack-scale system SW/FW, working with...  ...operate these systems reliably at fleet scale. In this role,...  ...infrastructure powers global intelligence, transforming every... 
    Intelligence
    Fleet
    Full time
    Shift work

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $272k - $431.25k

    We're looking for a Principal Software Engineer to join our CSP Engagements team as the technical focal...  ...to ensure they can reliably manage, update, and operate...  ...GPU firmware at fleet scale. You will drive work streams...  ...in Artificial Intelligence, High-Performance Computing... 
    Intelligence
    Fleet
    Full time

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $272k - $431.25k

     ...'re looking for a Principal Engineer to join our CSP Engagements team as the technical...  ...NVIDIA rack-scale systems, GPU architectures...  ...teams have reliable baseline measurements...  ...configuration, software, or workload differences...  ...in Artificial Intelligence, High-Performance... 
    Intelligence
    Full time

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $272k - $431.25k

     ...NVIDIA, as a Principal Rack Scale Systems Infrastructure Engineer, you will build...  ...of software systems. These...  ...safely at rack and fleet scale. Build open...  ...deployments and CSP environments....  ...needs. Establish reliability, security,...  ...powers global intelligence, transforming... 
    Intelligence
    Fleet
    Full time
    Shift work

    NVIDIA

    Santa Clara, CA
    4 hours ago
  • $272k - $431.25k

    NVIDIA Corporation is seeking a Principal Software Engineer to join the CSP Engagements team in Santa Clara, CA. This role is pivotal for driving rack-scale system software and firmware architecture...  ..., and monitoring of systems at a fleet scale. The ideal candidate has... 
    Fleet

    NVIDIA Corporation

    Santa Clara, CA
    1 day ago
  • $184k - $287.5k

    Senior Software Engineer, Cloud-Native Stack - CSP Engagements page is loaded Senior Software Engineer, Cloud-Native Stack...  ...track record debugging large-scale, cloud-native stacks across networking...  ...developments in Artificial Intelligence, High-Performance Computing and... 
    Intelligence
    Full time

    NVIDIA Corporation

    Santa Clara, CA
    5 days ago
  • $184k - $287.5k

    NVIDIA is seeking a Senior Software Engineer, NCCL and CUDA specialization...  ...Cloud Service Provider (CSP)Engagements team, focusing on ML software...  ...layer for deployment at scale. The role combines deep technical...  ...developments in Artificial Intelligence, High-Performance... 
    Intelligence
    Full time

    NVIDIA

    Santa Clara, CA
    3 days ago
  • $184k - $287.5k

    ## Lead Systems Software Test Engineer - CSP EngagementsApplylocations: US, CA, Santa...  ...Service Provider (CSP) Engagements team, focusing on ML...  ...expertise from cluster to rack scale full-stack validation with...  ...developments in Artificial Intelligence, High-Performance... 
    Intelligence
    Local area

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago
  • $272k - $431.25k

    ## Principal Software Engineer, DGX Cloud Production EngineeringApplylocations...  ...NVIDIA DGX Cloud is scaling GPU infrastructure...  ..., automation, and reliability across large-scale...  ..., or multi-cloud fleet operations.*...  ...developments in Artificial Intelligence, High-Performance Computing... 
    Intelligence
    Fleet

    NVIDIA

    Santa Clara, CA
    5 days ago
  • $272k - $431.25k

    NVIDIA Corporation in Santa Clara, California seeks a Principal Software Engineer to oversee fleet-scale reliability for CSP customers. The role involves collaborating with engineering teams to optimize NVIDIA platforms for production reliability. The ideal candidate has... 
    Fleet

    NVIDIA Corporation

    Santa Clara, CA
    1 day ago
  • $152k - $241.5k

     ...developments in Artificial Intelligence, High-Performance Computing...  ...Experience supporting large‑scale HPC clusters using Slurm,...  ...host lifecycle management, fleet reliability/auto‑healing, E2E observability...  ..., or Ruby. Mentored other engineers and influenced technical... 
    Intelligence
    Fleet

    NVIDIA Gruppe

    Santa Clara, CA
    5 days ago
  • $210k - $247k

     ...capacity and deliver reliable, affordable, and sustainable...  ...opportunity for a Principal Cloud Platform Software Engineer to join our world-...  ...engineering team as we scale our global fleet. In this role, you...  ...We may use artificial intelligence (AI) tools to support... 
    Intelligence
    Fleet
    Local area
    Remote work
    Flexible hours

    Mainspring Energy

    Menlo Park, CA
    5 days ago
  • $292k

     ...technology leader for our Engineering Operations and Site Reliability Engineering for our...  ..., and large-scale system operations,...  ...our optimized AI/HPC software stack. We enable...  ...Drive execution across fleet operations,...  ...infrastructure powers global intelligence, transforming every... 
    Intelligence
    Fleet
    Full time

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $320k

     ...Distinguished Engineer to lead...  ...Provider (CSP) ecosystem...  ...exabyte scale. You will...  ...with site-reliability, operations...  ...stance. Engage deeply with...  ...and Principal storage architects...  ...like live software upgrades,...  ...global intelligence, changing...  ...largest GPU fleet's... 
    Intelligence
    Fleet
    Full time
    Worldwide

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $110k - $150k

     ...infrastructure needed to bring intelligence to every moving machine on...  .... About the role As a Fleet Reliability Engineer at Applied Intuition, you...  ...operational hardware and software—specifically sensor suites...  ...will directly enable and scale our broader fleet operations... 
    Intelligence
    Fleet
    Full time
    For contractors
    For subcontractor
    Casual work
    Work at office
    Remote work
    Day shift

    Decisive Point

    Sunnyvale, CA
    3 days ago
  • $272k - $431.25k

     ...productivity required for strong scaling for HPC and generative AI...  .... We are looking for expert engineers to come and help design rack...  ...doing: Drive next generation fleet management solutions for scaling...  ...infrastructure powers global intelligence, transforming every industry... 
    Intelligence
    Fleet
    Full time

    NVIDIA

    Santa Clara, CA
    2 days ago
  •  ...AMD’s AI Customer Engineering organization,...  ...silicon, system, and fleet-level issues to...  ...quality, and large-scale deployment...  ...executive‑level customer engagements. Technical Debug...  ...and data center reliability issues. Aggregate...  ...use Artificial Intelligence to help screen, assess... 
    Intelligence
    Fleet

    Advanced Micro Devices , Inc.

    Santa Clara, CA
    2 days ago
  • $272k - $431.25k

    What you’ll be doing: Drive system software architecture alignment and technical deep dives, acting as the primary software engineering contact for NPI projects with key customers. Collaborate with major customers to understand their roadmap, use cases, and requirements... 
    Shift work

    NVIDIA Gruppe

    Santa Clara, CA
    5 days ago
  • As a Principal Engineer, you will act as a hands‑on technical...  ...a distributed, high‑scale environment while leading...  ...to enhance system intelligence, decision‑making, and...  ...ensure scalability, reliability, and performance. Solve...  ...10+ years of software engineering experience... 
    Intelligence
    Temporary work

    Walmart

    Sunnyvale, CA
    5 days ago
  • $197.3k - $313.7k

     ...virtually. As the Principal Engineer focused on...  ...understanding of software development, architecture...  ...require active engagement on deployments,...  ...innovation at scale. Contribute to...  ...accurately and reliably. Critically evaluate...  ...uses artificial intelligence (AI) tools to... 
    Intelligence

    Centaur Labs

    Palo Alto, CA
    3 days ago
  • $200k - $225k

     ...mobile users. As a Senior Engineer, your role will...  ...understanding of Artificial Intelligence (AI) and Machine...  ...solutions for improved reliability. This position presents...  ...support highly scalable software features and...  ...maintainable code that scales and performs well for... 
    Intelligence
    Remote work

    Palo Alto Networks

    Santa Clara, CA
    2 days ago
  • $189.8k - $256.16k

     ...on the Databricks Data Intelligence Platform to unify and democratize...  ...Manager (TPM) for Reliability to lead the strategy,...  ...and product engineering teams at Databricks. As Databricks scales to support thousands of...  ...Platform Engineering, Compute Fleet Management, SRE, Security... 
    Intelligence
    Fleet
    Local area
    Worldwide

    Databricks

    Mountain View, CA
    4 days ago
  • $272k - $425.5k

    Principal Software Engineer – Large-Scale LLM Memory and Storage Systems page is loaded## Principal Software Engineer – Large-Scale LLM Memory and Storage Systemslocations: US, CA, Santa Clara: US, WA, Remote: US, MA, Remotetime type: Full timeposted on: Posted Todayjob... 
    Local area
    Remote work

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago
  • $216.15k - $262k

     ...abundance of energy and intelligence. As the only...  ...believe in the scale of our ambition...  ...the entire NPI engagement model from chip...  ...it happen across engineering, hardware...  ...effects. Deep software/firmware lifecycle...  ...versioning matters for fleet reliability at scale.... 
    Intelligence
    Fleet
    Temporary work

    Crusoe

    Sunnyvale, CA
    1 day ago
  • $184k - $287.5k

    NVIDIA is seeking a Senior Firmware Engineer to join our CSP Engagements team, focusing on system software for Datacenter products such as GB200. This role combines deep...  ..., and performance optimization for large‑scale data center environments. Collaborate with AE, FAE... 

    NVIDIA Gruppe

    Santa Clara, CA
    5 days ago
  • $166k - $201k

     ...Senior Software Engineer Crusoe is on a mission to accelerate...  ...of energy and intelligence. As the only vertically...  ..., who believe in the scale of our ambition and thrive...  ...highly performant, reliable, and scalable...  ...for our services and fleet High Velocity... 
    Intelligence
    Fleet
    Temporary work

    G2 Venture Partners

    Sunnyvale, CA
    1 day ago
  • $175k - $250k

     ...Staff Site Reliability Engineer Figure is an AI robotics company developing...  ...robots with human level intelligence. Its robots are engineered...  ...Management, CI/CD systems, software distribution, supplier portals...  ...to automate deployment and scaling. Establish strong... 
    Intelligence
    Full time

    Figure

    Sunnyvale, CA
    4 days ago
  •  ...cybersecurity to software-defined networking...  ...commercialization success. As our Principal Distributed Systems Research Engineer, you won’t just be...  ...stack, from how intelligent machines...  ...actively engage with a highly distributed...  ..., deploy, and scale intelligence and physical... 
    Intelligence
    Work from home
    Home office
    Flexible hours

    Dormont Manufacturing Co

    Sunnyvale, CA
    3 days ago
  • $300 per month

     ...Software Engineer Crusoe is on a mission to accelerate the...  ...of energy and intelligence. As the only vertically...  ..., who believe in the scale of our ambition and thrive...  ...for Crusoe's fleet GPU's and data center...  ...distributed systems, reliability, and cloud platforms... 
    Intelligence
    Fleet
    Temporary work

    Crusoe

    Sunnyvale, CA
    4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Principal Software Engineer, At-Scale Reliability and Fleet Intelligence — CSP Engagements. Be the first to apply!