Senior System Architect, Infrastructure Reliability

$184k - $287.5k

NVIDIA

NVIDIA is seeking a Senior System Architect: Heterogeneous EDA Systems to solve a complex challenge in accelerated computing: Failure Attribution at Scale. As EDA or equivalent experience workloads scale across thousands of heterogeneous nodes, a single failure can cause massive resource waste. We need an engineer to develop and build an automated framework. This framework will ingest telemetry from CPU and GPU clusters to identify the root cause of job failures in real-time. It will distinguish between hardware faults, infrastructure instability, and software defects.

What you'll be doing:

Architect Failure Attribution Frameworks: Build a scalable "flight recorder" for EDA jobs that captures high-fidelity state across the CPU, GPU, and Fabric at the moment of failure.
Build automated diagnostics that correlate GPU XID errors, PCIe bus failures, and CUDA memory exceptions. Connect these errors with system-level events such as OOM kills or NUMA-related hangs.
Distributed Logging & Tracing: Implement low-overhead tracing mechanisms (using tracing tools or custom agents) that provide access to job execution across multi-node Slurm or Kubernetes clusters.
Root Cause Automation: Develop heuristics and models based on machine learning to classify failures as "Hardware Fault," "Software Bug," or "Environment Issue." This reduces the Mean Time to Identify (MTTI) for R&D teams.
Resiliency Engineering: Work closely with hardware and infrastructure teams to define "signals of impending failure," enabling proactive job migration or check-pointing before a crash occurs.

What we need to see:

Distributed Systems Mastery: BS, MS, or PhD in Computer Science or Electrical Engineering (or equivalent experience) with 6+ years in systems programming.
Experience building automated RCA (Root Cause Analysis) pipelines for HPC or cloud-scale environments.
CPU Architecture Deep-Dive: Expert knowledge of x86/ARM node-level metrics: IPC (Instructions Per Cycle), cache contention, NUMA imbalance, and hardware interrupts.
Programming Proficiency: Strong C++ and Python skills, with the ability to build high-performance daemons that monitor system health without impacting workload performance.
Scale Experience: Familiarity with cluster resource managers (Slurm, LSF, or Kubernetes) and how they manage job lifecycle and signal propagation.

Ways To Stand Out From The Crowd:

Low-Level Diagnostics: Expert knowledge of the Linux kernel and its error-reporting interfaces (/dev/mcelog, dmesg, journald). Understand how the kernel handles hardware exceptions and memory faults.
GPU Infrastructure Proficiency: Deep experience with the NVIDIA DCGM (Data Center GPU Manager) and NVIDIA Management Library (NVML) for monitoring device health and capturing state-dumps.
Experience with tools doing non-intrusive monitoring of application health and syscall-level failure patterns.
Experience with checkpoint/restore technologies (like CRIU) and their application in long-running EDA flows.

#LI-Hybrid

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

You will also be eligible for equity and benefits ( .

Applications for this job will be accepted at least until June 19, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Apply

Vacancy posted 3 days ago

Similar jobs that could be interesting for youBased on the Senior System Architect, Infrastructure Reliability in Santa Clara, CA vacancy

Senior Solutions Architect - Data Center Infrastructure
$152k - $241.5k
...seeking a hands-on, action-oriented Senior Solutions Architect to join our team, focused on the technical... ...role requires a strong passion for system design and a successful history of... ...high-performance, distributed AI infrastructure on-prem or in the cloud built with the...
Senior
NVIDIA
Santa Clara, CA
4 days ago
Senior Principal System Architect
$323k
...outstanding and visionary Lead Architect to drive the definition and... ...of next-generation system architectures and technologies... ...craft the future of compute infrastructure across data center and networking... ...the architectural vision to senior collaborators and technical...
Senior
Work at office
Local area
Remote work
ARM
San Jose, CA
2 days ago
Senior Cloud Infrastructure Engineer
...Intuitive Surgical, Inc. is seeking a Senior Software Engineer to design, build, and operate foundational cloud infrastructure and developer enablement capabilities. This role... ...teams to build and deploy services quickly, reliably, and securely. As part of the team, you...
Senior
Intuitive Surgical
Sunnyvale, CA
3 days ago
Senior Machine Learning Infrastructure Engineer
$183.7k - $248.6k
...opportunity Unity is looking for a Senior Machine Learning Infrastructure Engineer to join our Vector Ads team, where we build the real‑time systems that power Unity's global advertising... ...bidding, and targeting systems run reliably at scale. This is a great opportunity...
Senior
Work at office
Remote work
Worldwide
Relocation package
Unity
Mountain View, CA
4 days ago
Senior Software and System Architect
$148k - $287.5k
Senior Software and System Architect page is loaded Senior Software and System Architect Apply locations US, CA, Santa Clara US, CA, Remote US, NY, Remote... ...to complex problems Writing effective, clear and reliable architecture specifications Evaluating new technologies...
Senior
Full time
Remote work
NVIDIA
Santa Clara, CA
3 days ago
Senior Infrastructure Engineer
...Senior Infrastructure Engineer We are seeking a Senior Infrastructure Engineer with a strong focus... ...and automation to build scalable, reliable, and efficient infrastructure solutions... ...monitoring, and scaling processes to enhance system efficiency and reliability....
Senior
Omni Inclusive
San Jose, CA
2 days ago
Machine Learning Infrastructure Engineer
$300k
...The Role We're looking for a distributed ML infrastructure engineer to help extend and scale our training systems. You’ll work side-by-side with world-class... ...external visibility • Improve training system reliability, maintainability, and performance • While much...
Full time
Flexible hours
Institute Of Foundation Models
Sunnyvale, CA
22 hours ago
Senior Machine Learning Infrastructure Engineer
$160k - $200k
...join its fast-growing teams. As a Senior ML Infrastructure Engineer at Plus, you will design scalable... ...for managing model versioning systems and experiment tracking frameworks, which... .... Ensure high availability and reliability of the ML platform by implementing robust...
Senior
PlusAI
Santa Clara, CA
12 days ago
Senior Machine Learning Engineer - GenAI, LLM, Agentic AI
...development, and deployment of advanced AI agents and agentic systems. Architect and implement complex multi-agent systems, including... ...autonomy and intelligence. Build robust, scalable, and reliable infrastructure to support the deployment and operation of AI agents at...
Senior
Full time
Work experience placement
Eightfold
Santa Clara, CA
22 hours ago
Senior Technical Architect
...Senior Technical Architect Santa Clara, California Job Summary: Candidates with 8 to 14 years... ...maintain reusable block-level and sub-system testbenches using SystemVerilog and... ...or Perl to create custom automation infrastructure for regressions and metrics tracking...
Senior
Hcltech
Santa Clara, CA
22 hours ago
Senior Infrastructure Engineer
$126.4k - $192.8k
...a lower carbon future. As Senior Infrastructure Engineer, you will be the technical... ...the details – configuring systems, solving hard problems, and... ..., optimization, and reliability of our cloud estate. Kubernetes... ...Cloud Professional Cloud Architect, Certified Kubernetes...
Senior
Work at office
QuantumScape Corporation
San Jose, CA
4 days ago
Senior Solutions Architect
$224k - $356.5k
...NVIDIA is looking for an experienced network infrastructure Solutions Architect. Do you want to be part of a team that brings Artificial Intelligence... ...well as RDMA and TCP protocol tacks Good understanding of system hardware architecture impact on network performance,...
Senior
NVIDIA
Santa Clara, CA
2 days ago
Principal Systems Engineering Architect
$164k - $215k
...MaxLinear is seeking a Principal Systems Engineering Architect to join our Analog Mixed Signal (AMS... ...and connectivity, wired and wireless infrastructure, and industrial and multi-market applications... ...revolution; and enable robust and reliable communication in harsh industrial...
Work experience placement
Live in
MaxLinear
San Jose, CA
3 days ago
CPU Systems Architect - Senior Level
$198.7k - $298.1k
...Engineering General Summary: We are seeking an experienced CPU System Architect to join our team. If you possess a deep understanding of CPU... ...meet the needs of stakeholders, including engineering teams, senior management and internal partners. Develop High-Level...
Senior
Work experience placement
Work from home
Qualcomm
Santa Clara, CA
-97
Senior Solutions Architect - Cyber Resiliency
$197.2k - $255.2k
...Texas, Arkansas, Louisiana. This Solutions Architect position supports the central... ...environments, including ONTAP and E-Series systems Guide customers on protecting data that... ...data protection, security, or storage infrastructure, datacenter technologies ~ Experience...
Senior
Work at office
Local area
Shift work
NetApp
San Jose, CA
3 days ago
Remote Senior DevOps Engineer - AI Infrastructure & Scale
...innovative AI solutions company is seeking a Senior DevOps Engineer to architect and maintain the core infrastructure supporting cutting-edge AI applications. The... ..., and championing best practices in system reliability. Ideal candidates should have over 7 years of...
Senior
Remote job
Full time
Flexible hours
New Code Inc
Palo Alto, CA
4 days ago
Senior Software Engineer, Core Infrastructure
$204k - $259k
...Software Performance team ensures that systems running on our Waymo's meet strict performance... ...performance guarantees of many 9s of reliability while promoting high velocity of system... ...We prefer: 2 years in infrastructure/systems/performance domain optimizing end...
Senior
Full time
Work at office
Remote work
Waymo
Mountain View, CA
22 hours ago
Senior Principal Engineer- End-to-End AI Training Framework
$240k - $320k
...engineering of AI-based Autonomous Driving systems. Job Description As the Senior Principal Engineer, E2E AI... ..., ensuring its scalability, reliability, and performance. The ideal candidate... ...leads (e.g. data engineering, infrastructure) to define and drive the overall...
Senior
Full time
Work experience placement
Local area
Flexible hours
Bosch Group
Sunnyvale, CA
22 hours ago
Solutions Architect, AI Infrastructure
$176k - $270.25k
...NVIDIA is looking for an experienced network and systems infrastructure Solutions Architect. Do you want to be part of a team that brings new Artificial Intelligence (AI) hardware and software technologies to production in the field? We are looking for a compute and networking...
Local area
Remote work
NVIDIA
Santa Clara, CA
3 days ago
Senior Reliability Engineer
...Position : Senior Reliability Engineer Category : Engineering/Technical Type: Temporary Location : San Jose, CA No of Positions... ...engineering. •Prepare, setup, and execute component, sub-system and system-level tests. •Create and implement test plans to...
Senior
Contract work
Temporary work
MILLENNIUMSOFT
San Jose, CA
2 days ago
Principal Site Reliability Engineer
$147k - $237.5k
...XPANSE. As a Principal Site Reliability Engineer within the Cortex... ...engineering, product, and infrastructure teams to influence architecture... ...observability, distributed systems, automation, and incident... ...skills with experience mentoring senior engineers and leading...
Full time
Work at office
Palo Alto Networks
Santa Clara, CA
3 days ago
Senior Lead Site Reliability Engineer
...yourself among the top echelon in site reliability. As a Senior Lead Site Reliability Engineer at JPMorgan Chase within the Infrastructure Platforms and Foundational Services (IPFS... ...and reliability designs for complex systems which are robust, stable, and do not incur...
Senior
J.P. Morgan
Palo Alto, CA
9 days ago
Sr Principal, Infrastructure Business Operations
...Sr Principal, Infrastructure Business Operations At F5, we strive to bring a better digital... ..., meaningful experience partnering with senior leaders in infrastructure leadership roles... .... Drive the adoption of tools, systems, and processes that enhance planning, execution...
Senior
Local area
F5
San Jose, CA
4 days ago
Principal Engineer, Infrastructure Hardware
$164.8k - $226.6k
...silicon technology, have kept systems in sync, but they... ...lower power, and better reliability. With more than 4 billion... ...seeking a hands-on Principal Infrastructure Hardware Engineer to architect, design, and deliver... ...reports, ideal for a senior designer who excels at deep...
SiTime Corporation
Santa Clara, CA
more than 2 months ago
Senior Site Reliability Engineer- Sunnyvale, CA, the US
...Job Description Job Description About the Role Senior Site Reliability Engineer (Payments Infrastructure) Kody is seeking a Senior Site Reliability... ...reliability across mission-critical payment processing systems operating in Europe, Asia, and North America. Responsibilities...
Senior
Kody
Sunnyvale, CA
8 days ago
Senior Manager, Technical Solutions Manager
$207k - $275k
...CoreWeave combines superior infrastructure performance with deep technical... ...at scale has a seamless, reliable, and high-performance experience... ...data centers, hardware systems, and customer workloads to maintain... ...search for a remarkable Senior Manager of Technical...
Senior
Permanent employment
Temporary work
Casual work
Work at office
Flexible hours
CoreWeave
Sunnyvale, CA
4 days ago
Senior Machine Learning Engineer, Recommendation & AI Applications
$195k - $230k
...advanced AI, recommendation systems, and adtech. Recognized by... ...our mission: building the infrastructure layer for content intelligence... ...We are looking for a Senior Machine Learning Engineer to... ...and translating ideas into reliable systems. What You’ll Work...
Senior
Full time
Local area
Work from home
Newsbreak
Mountain View, CA
22 hours ago
Senior Infrastructure Architecture Engineer
...Job Description: We are looking for a senior infrastructure architecture engineer who can take... ...of complex platform and infrastructure systems. This role is focused on system architecture... ...where boundaries, data movement, reliability, tooling, governance, and long-term maintainability...
Senior
Local area
Maxinsights Corporation
Santa Clara, CA
11 days ago
Data Solutions Architect
...Job Title: Data Solutions Architect Duration- Fulltime Permanent... ..., and multi-tenant system design, along with the ability... ...Performance, Scalability & Reliability - Define strategies for... ...engineering, security, and infrastructure teams - Translate requirements...
Permanent employment
Full time
Q1 Technologies
Santa Clara, CA
1 day ago
Senior Site Reliability Engineer
$150k - $175k
...Site Reliability Engineer At ASAPP, our mission is simple: deliver the best AI-powered... ...performance and reliability of ASAPP's infrastructure and products. The team owns the entire... ...automate building reliable and performant systems. We emphasize building tools over...
Senior
Remote work
ASAPP
Mountain View, CA
22 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior System Architect, Infrastructure Reliability. Be the first to apply!