Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Principal Software Engineer, Rack-Scale System Software — CSP Engagements

$272k - $431.25k
Full-time

NVIDIA

We're looking for a Principal Software Engineer to join our CSP Engagements team as the technical focal point for rack-scale system SW/FW, working with CSP engineering teams to ensure they can deploy, monitor, and operate these systems reliably at fleet scale. In this role, you will collaborate with NVIDIA's cross-functional rack-scale system SW/FW engineering teams with dedicated CSP-facing technical leadership. Your focus is on the system-level software that manages, monitors, and recovers the rack as a whole — fabric management, GPU/NVSwitch error handling and recovery, health telemetry APIs, firmware update orchestration, and SW-driven serviceability. You will drive work streams with CSP engineering teams to build shared understanding of the architecture, incorporate their operational feedback, and ensure integration readiness. What you'll be doing: Drive rack-scale SW/FW architecture alignment across CSP engagements — including fabric management software, link health monitoring, GPU/NVSwitch error handling, SW/FW serviceability features (e.g., hot-plug support, component isolation, firmware-driven recovery), and multi-component firmware orchestration Drive technical work streams with CSP engineering teams on rack-scale system software — ensuring they deeply understand fabric management, NVSwitch behavior, error handling and recovery policies, health telemetry APIs, and SW/FW-controlled recovery operation Capture and synthesize CSP engineering feedback on rack-scale system software — health monitoring APIs, SW-driven serviceability workflows, firmware update orchestration, and error recovery behavior — champion that feedback into NVIDIA's architecture decisions Collaborate with multi-functional teams to ensure customer operational requirements are reflected in system software and firmware development Identify cross-CSP patterns in rack-scale SW/FW issues, error handling behavior, and system configuration practices — drive documentation, tooling, and test strategy improvements as a result Collaborate with execution teams on left-shift strategy — ensuring customer-side SW/FW integration work is identified early and completed ahead of hardware availability Make critical technical decisions on rack-scale system SW/FW tradeoffs and mitigate execution risks through early engagement with CSP engineering teams What we need to see: 15+ years of experience in system software, platform firmware, or large-scale distributed systems engineering. BS or MS in Computer Science, Electrical Engineering, or related field (or equivalent experience) Deep understanding of rack-scale system software challenges: multi-component coordination, error propagation, health monitoring, and serviceability / reliability Experience with fabric management software, cluster management, or system-level orchestration frameworks. Familiarity with firmware architectures and update lifecycle management (multi-component update sequencing, rollback, recovery) Understanding of error handling and recovery design patterns in distributed systems — fault isolation, retry policies, graceful degradation Experience with health monitoring and telemetry systems: health scoring, event correlation, API design for fleet-level observability Understanding of GPU or accelerator system software (drivers, device management, power management) is a strong plus Customer obsession — genuine passion for understanding how CSPs operate sophisticated systems at fleet scale and simplifying their experience Proven success providing technical leadership across organizational boundaries and influencing system software design without direct authority. Strong communication — ability to translate complex system software architecture into actionable mentorship for customer engineering teams Ways to stand out from the crowd: Experience with NVIDIA NVSwitch, NVOS, or GPU fabric management software Background in system software for large-scale clusters at a hyperscaler (cluster management, fleet orchestration, health platforms) Experience crafting error handling and recovery frameworks for multi-component systems (hundreds or thousands of coordinating devices) Familiarity with GPU or accelerator fleet operations — driver lifecycle, firmware rollout strategies, health-based scheduling Understanding of how system software decisions impact serviceability, availability, and operational cost at fleet scale NVIDIA’s invention of the GPU in 1999 fueled the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern deep learning — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as “the AI computing company.” We're looking to grow our company and establish teams with the most thoughtful people in the world. Are you ready to change the next generation of computing? Join us at the forefront of technological advancement. NVIDIA data center systems, such as DGX and HGX, have become core to NVIDIA's rapidly growing enterprise and cloud provider businesses. These platforms bring together the full power of NVIDIA GPUs, NVIDIA NVLink, NVIDIA InfiniBand networking, NVIDIA Grace CPUs, and a fully optimized NVIDIA AI and HPC software stack. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 272,000 USD - 431,250 USD. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until June 30, 2026. This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. NVIDIA pioneered accelerated computing. Today, our AI infrastructure powers global intelligence, transforming every industry. Learn more about NVIDIA.

Vacancy posted 2 days ago
Similar jobs that could be interesting for youBased on the Principal Software Engineer, Rack-Scale System Software — CSP Engagements in Santa Clara, CA vacancy
  • $272k - $431.25k

    Overview We're looking for a Principal Software Engineer to join our CSP Engagements team as the technical focal point for rack-scale system software and firmware. In this role, you will collaborate with NVIDIA's cross-functional rack-scale system software and firmware... 
    Suggested
    Shift work

    NVIDIA Corporation

    Santa Clara, CA
    1 day ago
  • $272k - $431.25k

    We’re looking for a Principal Software Engineer to join our CSP Engagements team as the technical focal point for fleet-scale reliability, working directly...  ...enables you to distinguish systemic architectural gaps from...  ...expertise in multi-NUMA, rack-scale system software and... 
    Suggested

    NVIDIA Corporation

    Santa Clara, CA
    1 day ago
  • $272k - $431.25k

    NVIDIA Corporation is seeking a Principal Software Engineer to join the CSP Engagements team in Santa Clara, CA. This role is pivotal for driving rack-scale system software and firmware architecture, ensuring seamless integration, operation, and monitoring of systems at... 
    Suggested

    NVIDIA Corporation

    Santa Clara, CA
    1 day ago
  • $272k - $431.25k

    We're looking for a Principal Software Engineer to join our CSP Engagements team as the technical focal...  ...for GPU firmware and GPU system software, working directly...  ...GPU firmware at fleet scale. You will drive work streams...  ...hundreds of GPUs per rack Serve as the technical... 
    Suggested
    Full time

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $272k - $431.25k

     ...impact on the world. At NVIDIA, as a Principal Rack Scale Systems Infrastructure Engineer, you will build and guide the development of software systems. These systems support our...  ...integration APIs for internal deployments and CSP environments. Bridge hardware and... 
    Suggested
    Shift work

    Jobleads-US

    Santa Clara, CA
    3 days ago
  • $272k - $431.25k

    Overview We're looking for a Principal Engineer to join our CSP Engagements team as the technical...  ...patterns and drive systemic improvements in documentation...  ...for the latest NVIDIA rack-scale systems, GPU architectures...  ...identify configuration, software, or workload differences... 

    NVIDIA Corporation

    Santa Clara, CA
    1 day ago
  • $184k - $287.5k

    ## Lead Systems Software Test Engineer - CSP EngagementsApplylocations: US, CA, Santa Claratime type: Full...  ...join our Cloud Service Provider (CSP) Engagements team, focusing on ML software stack...  ...expertise from cluster to rack scale full-stack validation with customer... 
    Local area

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago
  • Overview NVIDIA is seeking a Senior Software Engineer to join our CSP Engagements team, focusing on system software for datacenter products such as GB200. This role combines...  ..., and performance optimization for large‑scale data center environments. Collaborate with AE, FAE... 

    NVIDIA Gruppe

    Santa Clara, CA
    5 days ago
  • $120k - $275k

     ...solutions from silicon to systems including hardware and software to train and run the...  ...micro‑architects and design engineers to join our team to...  ...a Senior Platform TPM, Rack‑Scale AI Systems to drive design...  ...external CM/JDM partner engagement. You will work closely with... 
    Full time
    Contract work
    Work experience placement
    Local area
    Remote work
    Monday to Friday
    Flexible hours

    MatX

    Mountain View, CA
    2 days ago
  • NVIDIA Gruppe is seeking a Principal Software Engineer for the CSP Engagements team in Santa Clara, California. The role involves being the technical focal point for rack-scale system software and firmware, ensuring reliable deployment and operation of complex systems... 

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • NVIDIA Corporation is seeking a Senior Systems Software Engineer to join its advanced infrastructure software team in Santa Clara, California....  ...designing, developing, and maintaining high-performance, rack-scale management solutions. The role emphasizes work in Rust, Go... 

    NVIDIA Corporation

    Santa Clara, CA
    5 days ago
  • $184k - $287.5k

    Senior Software Engineer, Cloud-Native Stack - CSP Engagements page is loaded Senior Software Engineer,...  ...developing advanced multi-rack, multi-tenant AI/ML...  ...record debugging large-scale, cloud-native stacks across...  ...experience in distributed systems (Go, Rust, C/C++ or Python... 
    Full time

    NVIDIA Corporation

    Santa Clara, CA
    5 days ago
  • $272k - $425.5k

    Principal Software Engineer – Large-Scale LLM Memory and Storage Systems page is loaded## Principal Software Engineer – Large-Scale LLM Memory and Storage Systemslocations: US, CA, Santa Clara: US, WA, Remote: US, MA, Remotetime type: Full timeposted on: Posted Todayjob... 
    Local area
    Remote work

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago
  • $168k - $258.75k

     ...Manager to join the CSP Engagements team, focused on...  ...generation datacenter systems such as Vera Rubin...  ...and embedded software leaders—including software engineering managers, technical...  ...successful large‑scale deployment of NVIDIA...  ...‑based server and rack‑scale platforms.... 

    NVIDIA

    Santa Clara, CA
    5 days ago
  • $184k - $287.5k

    NVIDIA is seeking a Senior Software Engineer, NCCL and CUDA specialization...  ...Cloud Service Provider (CSP) Engagements team, focusing on ML software...  ...layer for deployment at scale. Responsibilities Engage with...  ...understanding of operating systems and data‑center system architecture... 

    NVIDIA

    Santa Clara, CA
    3 days ago
  •  ...edge hardware and software innovation to deliver...  ...of innovative engineers dedicated to solving...  ...Senior Systems Software Engineer...  ...world problems at scale. What You’ll Be Doing...  ...KubeCon and GTC. Engage efficiently with upstream...  ...least one of public CSP infrastructure (GCP... 
    Worldwide

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago
  • $184k - $356.5k

    NVIDIA Corporation is looking for a Lead Systems Software Test Engineer for its CSP Engagements team in Santa Clara, California. This role demands deep technical...  ...a solid understanding of server platforms and large-scale environments. A competitive salary between $184,000... 

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago
  • $136k - $218.5k

     ...customer-facing hardware engineers to work directly with Cloud Scale Providers (CSP’s) deploying next...  ...centric Data Centers. The HW Systems Engineer is front-and-...  ...and Vera Rubin NVL72 racks, at our largest customers...  ...at the hardware, software and application level,... 

    NVIDIA Gruppe

    Santa Clara, CA
    5 days ago
  • $184k - $287.5k

    NVIDIA is seeking a Senior Firmware Engineer to join our CSP Engagements team, focusing on system software for Datacenter products such as GB200. This role combines deep...  ..., and performance optimization for large‑scale data center environments. Collaborate with AE, FAE... 

    NVIDIA Gruppe

    Santa Clara, CA
    5 days ago
  • $320k

     ..., technical leader to drive the engineering roadmap and innovation for our rack system software architecture. From firmware, kernel...  ...component leads internally and engage with industry leading...  ...architecture for NVIDIA’s rack‑scale products Maintain deep understanding... 
    Shift work

    NVIDIA Gruppe

    Santa Clara, CA
    5 days ago
  • $320k

     ...technology company is seeking a Distinguished Engineer for Rack Scale Architecture in Santa Clara, CA. This role involves driving the software architecture for NVIDIA's rack-scale products and requires 15+ years of system architecture experience. You will collaborate with... 
    Remote job

    NVIDIA Corporation

    Santa Clara, CA
    4 days ago
  • $320k

    NVIDIA Gruppe in Santa Clara is seeking a highly motivated technical leader to drive the engineering roadmap for their rack system software architecture. You will engage with industry-leading hyperscalars and manage software across various platforms while mentoring teams... 

    NVIDIA Gruppe

    Santa Clara, CA
    5 days ago
  • $272k - $431.25k

    What you’ll be doing: Drive system software architecture alignment and technical deep dives, acting as the primary software engineering contact for NPI projects with key customers. Collaborate with major customers to understand their roadmap, use cases, and requirements... 
    Shift work

    NVIDIA Gruppe

    Santa Clara, CA
    5 days ago
  • $184k - $287.5k

     ...our diagnostic systems need to evolve...  ...technical leader to engineer and propel...  ...Service Provider (CSP) deployments,...  ...involve hardware and software tools to...  ...performance at scale across ODM and...  ...communication skills to engage with technical...  ...across rack-level or cluster... 

    NVIDIA

    Santa Clara, CA
    1 day ago
  • $184k - $287.5k

    Senior Systems Software Engineer - DGX Cloud, NVIDIA NVIDIA is a leader in hardware and software innovation...  ...enable smooth, low‑latency inference scaling on Kubernetes across thousands of GPU...  ..., KubeCon, GTC), while actively engaging with upstream groups (Kubernetes SIG... 

    NVIDIA Corporation

    Santa Clara, CA
    2 days ago
  • $184k - $287.5k

     ...world. We are looking for a dedicated engineer for the Senior Systems Software Engineer role, focusing on GPU Performance at Scale. At NVIDIA, this role is uniquely positioned...  ...and develop new, leading solutions. Engage with HPC, OS, CPU, GPU compute, and systems... 
    Remote work

    NVIDIA

    Santa Clara, CA
    2 days ago
  •  ...We are at the forefront of software and hardware innovation, pushing...  ...for the company's server systems and rack‑level product lines. Serves...  ...Ecosystem (L10-L12) Design and scale the L10-L12 partner network...  ...in Supply Chain, Business, Engineering, or related; 5+ years in... 

    MixMode

    Santa Clara, CA
    2 days ago
  • Senior Systems Software Engineer - GPU Performance at Scale We are looking for a dedicated engineer for the Senior Systems Software Engineer role, focusing on...  ...for NVIDIA GPUs, CPUs, and networking hardware. Engage early with HW/FW/SW/platform internal and customer teams... 

    NVIDIA Corporation

    Santa Clara, CA
    4 days ago
  •  ...to PCs, gaming and embedded systems. Grounded in a culture of innovation...  ...Customer Program Manager - Rack Scale CPU Solutions THE POSITION...  ...(CPM) to lead strategic CPU engagements supporting next‑generation...  ...across business units and engineering organizations while driving... 

    AMD

    Santa Clara, CA
    3 days ago
  • $184k - $287.5k

     ...cutting‑edge hardware and software innovation to deliver...  ...of forward‑thinking engineers tackling some of the...  ...searching for a Senior Systems Software Engineer with...  ...technical problems at large scale and help shape how AI...  ...GTC), while actively engaging with upstream groups (... 
    Full time
    Remote work

    NVIDIA

    Santa Clara, CA
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Principal Software Engineer, Rack-Scale System Software — CSP Engagements. Be the first to apply!