Principal Software Engineer - Rack Scale Systems Infrastructure

$272k - $431.25k

NVIDIA

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.

At NVIDIA, as a Principal Rack Scale Systems Infrastructure Engineer, you will build and guide the development of software systems. These systems support our upcoming rack-scale infrastructure products and services. This exceptional role sits where software meets hardware. You will work on control planes, state machines, orchestration systems, firmware, OS lifecycle, and networking fabrics. Your task is to compose infrastructure-as-a-service control plane software that converts complex rack-scale hardware into dependable, manageable, and programmable infrastructure for NVIDIA, partners, and leading cloud and enterprise clients globally.

What You Will Be Doing:

Define the complete software architecture for rack-scale infrastructure products and services, covering control plane services, infrastructure management, firmware, operating systems, kernel drivers, networking fabrics, accelerator software, and user‑mode manageability software.
Use Kubernetes and cloud‑native primitives as an infrastructure fabric when appropriate, including controllers, operators, reconciliation loops, and open source components.
Build open source infrastructure software that can be embraced in different forms, including libraries, services, controllers, operators, and integration APIs for internal deployments and CSP environments.
Bridge hardware and software teams across firmware, BMC, BIOS, boot flows, OS images, drivers, networking, NVLink domains, InfiniBand, GPUs, DPUs, CPUs, and system management interfaces.
Translate forward‑looking infrastructure roadmaps into formal software requirements, architecture specifications, and execution plans that align teams across the organization.
Partner directly with hyperscalers, CSPs, enterprise customers, internal component leads, vendors, and business partners to align infrastructure capabilities with real‑world deployment and integration needs.
Establish reliability, security, validation, and left‑shift strategies that reduce risk before hardware reaches production environments.
Mentor senior engineers and technical leads, raising the engineering bar for large‑scale networked systems, foundational software, and rack‑scale control plane development.
Make high‑quality technical decisions in ambiguous environments, balancing customer needs, schedule, hardware realities, software maintainability, open source adoption, and long‑term infrastructure evolution.

What We Need To See:

BS or MS in Computer Engineering, Computer Science, Electrical Engineering, or a related field, or equivalent experience.
Proven experience (15+ years) in systems architecture, system software, distributed systems, infrastructure control planes, or infrastructure engineering.
Solid architectural knowledge of coordination frameworks, state machines, declarative APIs, reconciliation loops, lifecycle orchestration, failure handling, upgrade and rollback workflows, and distributed systems tradeoffs.
Practical coding skills in Go, C++, or Rust, including the capability to write, review, and direct production‑quality infrastructure software. Experience with Rust is highly valued.
Experience with Kubernetes or similar orchestration systems, especially as a fabric for managing infrastructure, hardware resources, or large‑scale infrastructure services.
Experience with Linux‑based infrastructure software, OS rollout and image management, kernel or driver interactions, firmware lifecycle, and hardware bring‑up workflows.
Strong understanding of data center networking technologies and protocols, such as Ethernet, InfiniBand, RDMA, and fabric‑level manageability.
Experience with complex accelerator‑based systems, including GPUs, DPUs, FPGAs, custom silicon, or other high‑performance computing systems.
Expertise in in‑band and out‑of‑band management architectures, including BMCs, Redfish, IPMI, and related system management protocols.
Ability to work with security experts to define practical tradeoffs across secure boot, attestation, access control, update safety, serviceability, and ease of operation.
Experience crafting software intended for open source release, including API stability, modularity, documentation, community usability, and clean separation between shared software and deployment‑specific integrations.
Experience using AI‑assisted development tools responsibly as an engineering multiplier for coding, test generation, debugging, build iteration, and documentation.
Established skill in specifying requirements, guiding architecture, and managing delivery across various engineering teams and organizations.
Strong written and verbal communication skills, enabling clear explanation of complex hardware/software tradeoffs to engineering leaders, customers, partners, and executives.

Ways To Stand Out from the Crowd:

Built software supporting multiple adoption models — internal services, CSP‑integrated offerings, reusable libraries, and customer‑extensible APIs.
Strong Rust skills in systems, infrastructure, or hardware‑adjacent software.
Multiplied team impact through reference implementations, design reviews, shared libraries, architecture docs, dev workflows, and AI‑assisted engineering.
Hands‑on with fleet‑scale provisioning, updates, rollback, observability, health, and remediation.
Led across the full data center product lifecycle: inception, pre‑ and post‑silicon, manufacturing, deployment, and operations.
Familiar with open source ecosystems, contribution models, and balancing community collaboration with product needs.
Deep experience with rack‑ or cluster‑scale systems spanning compute, networking, storage, accelerators, firmware, and infra management as one operational domain.
Skilled at finding simple, durable abstractions in complex systems to align teams, customers, and long‑term direction.

Compensation and Benefits:

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 272,000USD– 431,250USD. You will also be eligible for equity and benefits.

Equal Employment Opportunity:

NVIDIA is committed to fostering a diverse work environment and is proud to be an equal opportunity employer. We highly value diversity in our current and future employees, and we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.

#J-18808-Ljbffr

Apply

Vacancy posted 4 days ago

Similar jobs that could be interesting for youBased on the Principal Software Engineer - Rack Scale Systems Infrastructure in Santa Clara, CA vacancy

Principal Software Engineer - Large-Scale LLM Memory and Storage Systems
$272k - $425.5k
Principal Software Engineer – Large-Scale LLM Memory and Storage Systems page is loaded## Principal Software Engineer – Large-Scale LLM Memory and Storage Systemslocations... ..., high-performance storage, or ML systems infrastructure in C/C++ and Python, with a track record of...
Suggested
Local area
Remote work
NVIDIA Corporation
Santa Clara, CA
3 days ago
Principal Systems Software Engineer
$272k - $431.25k
...NVIDIA is seeking a Sr. Principal Systems Software Engineer for the Apache Spark Acceleration group. GPU... ...processing problems challenges at large scale Provide recommendations and... ...decisions surrounding topics such as infrastructure, continuous integration and testing...
Suggested
Work experience placement
NVIDIA
Santa Clara, CA
3 days ago
Principal Systems Software Engineer, LPU
$272k - $431.25k
...We are now looking for a Principal Software Engineer for LPX System Software! NVIDIA’s LPX System Software team builds the foundational software that turns... ..., monitoring, and managing workloads at production scale. Drive triage of the most difficult sequencing, initialization...
Suggested
Shift work
NVIDIA Gruppe
Santa Clara, CA
21 hours ago
System / Clojure Principal Software Engineer
...framework and library. He/she will participate in the core system design and development. Our target system is based on... ...components. Qualifications ~5+ years proven records on infrastructure level software development experience ~2+ years Clojure development...
Suggested
Full time
Integrated Resources Inc.
Santa Clara, CA
21 hours ago
Principal System Software Engineer - Data Center MODS
$272k - $431.25k
Principal System Software Engineer - Data Center MODS page is loaded## Principal System Software Engineer... ...Principal Engineer to architect and scale next-generation L10 and L11... ...challenges within their unique data center infrastructures.**What we need to see:*** Bachelor'...
Suggested
NVIDIA Corporation
Santa Clara, CA
4 days ago
IT Infrastructure Systems Engineer
$108k - $162k
...in process control, combining global scale with an expanded portfolio of leading... ...We are seeking a highly skilled Sr. Systems & Infrastructure Engineer to join a dynamic, security-first IT... ...buildouts, hardware refresh planning, rack/power design, and operational support...
Permanent employment
Onto
Milpitas, CA
3 days ago
Principal Software Engineer, ML System Architect
$349k - $431k
...Principal Software Engineer, ML System Architect Waymo is an autonomous driving technology company with the... ...are increasingly leveraging large-scale Foundation Models to unlock new capabilities... ..., deep learning frameworks, and AI infrastructure. ~ A track record of architecting...
Full time
Remote work
Waymo
Mountain View, CA
2 days ago
Principal Platform Software Engineer - RAS
$272k - $431.25k
...productivity required for strong scaling for HPC and generative AI... ...are looking for expert engineers to come and help design rack level solutions for next... ...solutions for scaling AI infrastructure using GPUs and Grace... ...space complexity and project system resource requirements....
NVIDIA
Santa Clara, CA
3 days ago
Sr Principal Software Engineer (L7 Security)
...the Layer-7 Security Software team, we are responsible... ...Content Inspection Engine runs on hardware, virtualized... ...Layer‑7 security infrastructure. We design and develop... ...next‑generation firewall system Deliver features... ...programming and large‑scale, distributed, and/or high...
Palo Alto Networks, Inc.
Santa Clara, CA
4 days ago
Principal System Software Engineer - AV Platform
$272k - $431.25k
NVIDIA is seeking a highly motivated Principal System Software Engineer to drive next-generation innovations in automotive platform software, system... ...to the architecture, development, optimization, and scaling of foundational software technologies powering NVIDIA automotive...
NVIDIA
Santa Clara, CA
3 days ago
Principal System Software Engineer - CUDA Driver
$272k - $431.25k
...We are hiring senior engineers to work on the CUDA driver, a core component of our platform... ...programming model across a range of system configurations and hardware capabilities... ...experience) ~15+ years of relevant systems software development experience ~ Strong C...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Principal Software Engineer
$143k - $286k
.... What you'll do... Principal, Software Engineer We are seeking a talented... ...easy-to-maintain backend infrastructure. About Team:?... ...Architect complex software systems, ensuring performance, security... ...focusing on building large scale distributed systems.?? Experience...
Full time
Temporary work
Part time
Work at office
Flexible hours
Walmart
Sunnyvale, CA
2 days ago
Principal Software Engineer
$231.4k - $331.8k
...Platform & Identity Engineering Group, a foundational... ...responsible for building and scaling the core platform... ...the underlying systems that support Cisco's... ...security, and platform infrastructure, with an emphasis on... ...direction of Cisco's software and technology solutions...
Full time
Temporary work
Local area
Flexible hours
Webex Events (formerly Socio)
San Jose, CA
4 days ago
Principal Software Development Engineer
$99.6k - $234.6k
...Principal Software Development Engineer As a Principal Software Development Engineer in the Oracle Cloud Infrastructure (OCI) Security Platform division, you will... ...integration to distributed systems architecture and... ...operational excellence in large-scale production...
Temporary work
Flexible hours
Oracle
Santa Clara, CA
2 days ago
Principal Software Engineer
$165.22k - $283.23k
...the physical world. As a Principal Engineer, you will set the technical direction for systems that blend cloud infrastructure, machine learning, and hardware... ...Identify and address scaling challenges before they... ...0+ years of professional software development experience...
Local area
Immediate start
Siemens
Santa Clara, CA
2 days ago
Principal Software Engineer
$126k - $204.5k
...Prevention and Cloud Service Infrastructure team - We are at the core... .... Your Career As a Principal Software Engineer, you will play a key role... ...the design, deployment and scaling of our architecture as we... ...crafting robust distributed systems that achieve both short...
Full time
Temporary work
Work at office
Palo Alto Networks
San Jose, CA
3 days ago
Senior Principal Software Engineer
...seeking a visionary Senior Principal Software Engineer / Technical Staff Member... ...our next-generation test infrastructure, CI/CD automation, and developer... ...technical debt, and scaling global infrastructure.... ...driven applications (e.g., RAG systems, log analyzers, agentic IDEs...
Temporary work
For contractors
Work at office
Local area
Worldwide
Shift work
Celestica
San Jose, CA
3 days ago
Software Engineer Intern Recommendation Infrastructure
$45 - $60 per hour
...Introduction: The Recommendation Infrastructure Team is responsible... ...components, systems, and platforms. Our work... ...opportunity to join one of our engineering teams where you will... ...globalized large-scale recommendation system.... .../Masters degree in Software Development, Computer...
Hourly pay
Work experience placement
Internship
Summer internship
Local area
Flexible hours
Tik Tok
San Jose, CA
4 days ago
Principal Software Engineer, Network Management System (NMS) Application Development - Santa Clara, CA - Hybrid
$208k - $260k
...and manage their hybrid cloud infrastructure. Gigamon has served more than 4... ...educational organizations. As a Principal Software Engineer on the Network Management System team, you will lead the design... ...performance platforms that support large-scale deployments and long-term...
Local area
Worldwide
3 days per week
Gigamon
Santa Clara, CA
21 days ago
Software Engineer Manager, GCP Identity Infrastructure
$207k - $301k
...Software Engineer Manager, GCP Identity Infrastructure Location: Sunnyvale, CA, USA Compensation: US: $207,000 - $301,000 (USD) + 20... ...years of experience with developing large-scale infrastructure, distributed systems or networks, or experience with compute technologies...
Google Inc.
Sunnyvale, CA
3 days ago
Software Engineer Manager II, Top of Rack Infrastructure
$197k - $291k
# Software Engineer Manager II, Top of Rack InfrastructureGoogle • onsite • 1155 Borregas Ave building 1, Sunnyvale, CA 94089, USA • full\_timePay: USD 197000.00 - USD 291000.00 / unspecifiedBusinesses of all shapes and sizes rely on Google’s unparalleled advertising solutions...
Temporary work
Epic Games (Portuguese)
Sunnyvale, CA
2 days ago
Helix AI Engineer, Agentic Systems
...intelligence. Its robots are engineered to perform a variety... ...to create embodied AI systems that can perceive the... ...architectures and infrastructure that enable robots to... ...ML systems Strong software engineering skills and... ...Experience working with large-scale distributed training...
Full time
Figure
San Jose, CA
21 hours ago
Software Systems Engineering
...Introduction At IBM Software, we transform client... ..., open data lakehouse engineered to deliver category-leading... ..., state management systems, and data movement... ...workloads at petabyte scale. With a culture that values... ...the foundational infrastructure that makes watsonx.data...
IBM
San Jose, CA
4 days ago
Software Engineer - Systems Engineering AI Tooling
$125k - $185k
...creating the digital infrastructure needed to bring intelligence... ..., operating systems, and autonomy. Eighteen... ...the Role The Systems Engineering Tools & Traceability Engineer... ...development. Scale internal tooling to... ...in Computer Science, Software Engineering, Robotics,...
Full time
For contractors
For subcontractor
Casual work
Work at office
Remote work
Day shift
Applied Intuition
Sunnyvale, CA
4 days ago
Sr Principal Software Engineer (L7 Cloud Security)
...outcomes. Job Summary As a Principal Software Engineer within the Engineering team, you... ...implement, and troubleshoot high‑scale distributed systems, playing a pivotal role in shaping... ...microservice architectures, global network infrastructure, and load balancing. Working...
Full time
Work at office
Palo Alto Networks, Inc.
Santa Clara, CA
4 days ago
Infrastructure Software Engineer
$2,000 per month
...intelligence. We co-design chips, racks, software, and manufacturing to... ...and staffed by leading engineers, Etched is redefining the infrastructure layer for the fastest... ...includes building and scaling our hybrid high-... ...deep understanding of systems. It’s not just about writing...
Work at office
Relocation package
Etched
San Jose, CA
16 days ago
Senior AI Infrastructure Engineer
$180k - $240k
...solution that integrates advanced software and hardware powering the fleet,... ...We are seeking a Senior AI Infrastructure Engineer to design, build, and scale the high-performance AI platform... ...do Distributed Training & ML Systems Support Scale Research Workloads...
Odd job
Work at office
Gatik AI
Santa Clara, CA
2 days ago
AI Infrastructure Engineer
$45k - $121k
...Job Title: AI Infrastructure Engineer City: San Jose State/... ...that integrate AI, high-speed software-defined storage, and GPU-... ...fault tolerance across all systems. Networking & Security:... ...silos into a unified, web-scale environment. Professional...
Minimum wage
Local area
Wipro
San Jose, CA
4 hours ago
Principal Software Engineer, Onboard Infrastructure
$258k - $387k
...ecosystem to deploy autonomy at scale, from robotaxis and... ...About the Role As a Principal Software Engineer, you will help define and... ...Performance, and Onboard Systems, requiring deep technical... ...direction of Nuro's onboard infrastructure. We are looking for a technical...
Immediate start
Flexible hours
Nuro
Mountain View, CA
29 days ago
Remote Senior DevOps Engineer - AI Infrastructure & Scale
...innovative AI solutions company is seeking a Senior DevOps Engineer to architect and maintain the core infrastructure supporting cutting-edge AI applications. The role... ...deployments, and championing best practices in system reliability. Ideal candidates should have over 7...
Remote job
Full time
Flexible hours
New Code Inc
Palo Alto, CA
4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Principal Software Engineer - Rack Scale Systems Infrastructure. Be the first to apply!