Senior System Architect, Infrastructure Reliability

$184k - $287.5k

NVIDIA

NVIDIA is seeking a Senior System Architect: Heterogeneous EDA Systems to solve a complex challenge in accelerated computing: Failure Attribution at Scale. As EDA or equivalent experience workloads scale across thousands of heterogeneous nodes, a single failure can cause massive resource waste. We need an engineer to develop and build an automated framework. This framework will ingest telemetry from CPU and GPU clusters to identify the root cause of job failures in real-time. It will distinguish between hardware faults, infrastructure instability, and software defects.

What you'll be doing:

Architect Failure Attribution Frameworks: Build a scalable "flight recorder" for EDA jobs that captures high-fidelity state across the CPU, GPU, and Fabric at the moment of failure.
Build automated diagnostics that correlate GPU XID errors, PCIe bus failures, and CUDA memory exceptions. Connect these errors with system-level events such as OOM kills or NUMA-related hangs.
Distributed Logging & Tracing: Implement low-overhead tracing mechanisms (using tracing tools or custom agents) that provide access to job execution across multi-node Slurm or Kubernetes clusters.
Root Cause Automation: Develop heuristics and models based on machine learning to classify failures as "Hardware Fault," "Software Bug," or "Environment Issue." This reduces the Mean Time to Identify (MTTI) for R&D teams.
Resiliency Engineering: Work closely with hardware and infrastructure teams to define "signals of impending failure," enabling proactive job migration or check-pointing before a crash occurs.

What we need to see:

Distributed Systems Mastery: BS, MS, or PhD in Computer Science or Electrical Engineering (or equivalent experience) with 6+ years in systems programming.
Experience building automated RCA (Root Cause Analysis) pipelines for HPC or cloud-scale environments.
CPU Architecture Deep-Dive: Expert knowledge of x86/ARM node-level metrics: IPC (Instructions Per Cycle), cache contention, NUMA imbalance, and hardware interrupts.
Programming Proficiency: Strong C++ and Python skills, with the ability to build high-performance daemons that monitor system health without impacting workload performance.
Scale Experience: Familiarity with cluster resource managers (Slurm, LSF, or Kubernetes) and how they manage job lifecycle and signal propagation.

Ways To Stand Out From The Crowd:

Low-Level Diagnostics: Expert knowledge of the Linux kernel and its error-reporting interfaces (/dev/mcelog, dmesg, journald). Understand how the kernel handles hardware exceptions and memory faults.
GPU Infrastructure Proficiency: Deep experience with the NVIDIA DCGM (Data Center GPU Manager) and NVIDIA Management Library (NVML) for monitoring device health and capturing state-dumps.
Experience with tools doing non-intrusive monitoring of application health and syscall-level failure patterns.
Experience with checkpoint/restore technologies (like CRIU) and their application in long-running EDA flows.

#LI-Hybrid

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

You will also be eligible for equity and benefits ( .

Applications for this job will be accepted at least until March 1, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Apply

Vacancy posted 5 days ago

Similar jobs that could be interesting for youBased on the Senior System Architect, Infrastructure Reliability in Santa Clara, CA vacancy

Senior System Architect - LPU Platform Pathfinding
$184k - $287.5k
...NVIDIA is seeking a System Architect to lead rack-level and platform pathfinding for our next... ...center teams to deliver high-performance, reliable AI platforms. Other responsibilities... ...~ Solid understanding of data center infrastructure: rack power distribution, network...
Senior
NVIDIA
Santa Clara, CA
5 days ago
Senior Solutions Architect - Data Center Infrastructure
$152k - $241.5k
...seeking a hands-on, action-oriented Senior Solutions Architect to join our team, focused on the technical... ...role requires a strong passion for system design and a successful history of... ...high-performance, distributed AI infrastructure on-prem or in the cloud built with the...
Senior
NVIDIA
Santa Clara, CA
1 day ago
Senior Solutions Architect, Robotics Infrastructure
$152k - $241.5k
...large scale! We are seeking a hands-on Solutions Architect with deep expertise in backend infrastructure, inference and cloud-native applications to design... ...Architecture or Infrastructure Engineering, advancing AI/ML systems from proof of concept to production on private/...
Senior
NVIDIA
Santa Clara, CA
5 days ago
Senior Solutions Architect, AI Infrastructure
$184k - $287.5k
...NVIDIA is looking for an experienced GPU and network systems Solutions Architect & Engineer. Do you want to be part of a team that brings new... ...switches for Ethernet/Infiniband, and Data Center infrastructure (power/cooling) Knowledge of DevOps/MLOps technologies...
Senior
Remote work
NVIDIA
Santa Clara, CA
12 days ago
Senior ML Infrastructure Engineer (Compute)
...team owns the cloud-agnostic, reliable, and cost-efficient platform... ...We're proud to serve as the infrastructure platform for teams... ...Role: We are seeking a Senior ML Infrastructure engineer... ...running scalable distributed systems. They will rapidly test and...
Senior
Local area
Work from home
General Motors
Sunnyvale, CA
4 days ago
Senior Machine Learning Infrastructure Engineer
$160k - $200k
...join its fast-growing teams. As a Senior ML Infrastructure Engineer at Plus, you will design scalable... ...for managing model versioning systems and experiment tracking frameworks, which... .... Ensure high availability and reliability of the ML platform by implementing robust...
Senior
PlusAI, Inc.
Santa Clara, CA
2 days ago
Senior ML Infrastructure Engineer - Embodied AI
$153.2k - $234.1k
...breakthrough hardware and battery systems to intuitive design,... ...solutions that support safe and reliable autonomous vehicle behavior... ...-world scenarios. As a Senior ML Infra Engineer, you will... ...systems, applications, or ML infrastructure. ~ Experience designing robust...
Senior
Local area
Remote work
Work from home
Relocation package
Flexible hours
General Motors
Sunnyvale, CA
4 days ago
Senior ML Infrastructure Engineer - Embodied AI Scaling Foundations
$153.2k - $234.1k
...breakthrough hardware and battery systems to intuitive design,... ...where we build the critical infrastructure that powers every machine learning... ...to use, and exceptionally reliable. Your success will be... ...driverless vehicles. As a Senior ML Infra Engineer, you will...
Senior
Work at office
Local area
Remote work
Work from home
Relocation
Relocation package
Flexible hours
General Motors
Sunnyvale, CA
2 days ago
Principal Power Systems Architect, Quantum Infrastructure
...Principal Power Systems Architect, Quantum Infrastructure PsiQuantum's mission is to build the first useful quantum computers—machines capable of... ...radiated EMI mitigation, noise suppression, and system reliability . This person will help shape the electrical foundation...
Shift work
PsiQuantum
Palo Alto, CA
5 days ago
Principal Solutions Architect - GPU Cloud Network Infrastructure
$272k - $431.25k
...passionate about developing cloud infrastructure, we want to hear from you!... ...and creative solutions architect with experience in network interconnect... ...networking and low-latency systems ~ Experience designing... ...by improved network reliability and uptime. Efficient...
Remote work
NVIDIA
Santa Clara, CA
3 days ago
Senior IoT Systems Architect
...Senior IoT Systems Architect Location: Sunnyvale - Onsite 5 days/week Duration: Contract until Dec 2026 Visa: USC / GC... ...authentication PKI lifecycle management Develop reliable messaging strategies including: Offline buffering...
Senior
Contract work
H1b
3B Staffing LLC
Sunnyvale, CA
1 day ago
Senior System Architect, GPU
$184k - $287.5k
...and maintain our leadership. NVIDIA is seeking a motivated system architect to define future aspects of our GPU through employing pioneering... ...workloads. Develop and enhance architecture analysis infrastructure, including performance simulators, testbench components and...
Senior
Work experience placement
Night shift
NVIDIA
Santa Clara, CA
2 days ago
Senior System Architect, Hardware Architecture
$183k - $247.6k
...with a cross-functional team to drive system architecture across Amazon devices. Key job responsibilities As a Senior System Architect, you will be responsible for defining... ...Science, Product Design, Industrial Design, Reliability, and Operations. You are a hands-on...
Senior
Local area
Flexible hours
Amazon
Sunnyvale, CA
3 days ago
Senior Inference Systems Architect
Acceler8 Talent is seeking a Senior / Principal Machine Learning Engineer specializing... ...designing and optimizing multi-node inference systems, enhancing resource scheduling... ...startup environment to work on cutting-edge AI infrastructure. #J-18808-Ljbffr Acceler8 Talent
Senior
Acceler8 Talent
Santa Clara, CA
4 days ago
Senior ML Infrastructure Engineer, Inference Platform
$155.42k - $205.9k
...Platform is part of the AV ML Infrastructure organization. Our team owns the cloud-agnostic, reliable, and cost-efficient platform that... ...the Role: We are seeking a Senior ML Infrastructure engineer to... ...in designing distributed systems for ML, strong problem-solving...
Senior
Local area
Remote work
Work from home
Relocation
Relocation package
Flexible hours
General Motors
Sunnyvale, CA
4 days ago
Senior Thermal Systems Architect — Chip-to-Chiller
$237k - $329k
...global technology company based in Sunnyvale is seeking a Senior Staff Thermal Systems Architect for their Google Cloud division. This specialized role... ..., mentoring teams, and optimizing performance and reliability of electronic equipment. The position offers a competitive...
Senior
Google Inc.
Sunnyvale, CA
4 days ago
Senior AI Systems Architect - Deep Data & GenAI
$207k - $300k
...driving the technical strategy and execution for complex AI projects, ensuring scalable solutions and optimizing machine learning infrastructure. Potential candidates should have a Bachelor's degree and substantial experience in software development and project leadership...
Senior
Google Inc.
Sunnyvale, CA
5 days ago
Senior Software and System Architect
$152k - $241.5k
...and amazing people. NVIDIA is looking for an experienced Senior Software and System Architect to join our Networking Software Architecture group.... ...solutions to complex problems Writing effective, clear and reliable architecture specifications Evaluating new...
Senior
Remote work
NVIDIA
Santa Clara, CA
1 day ago
Senior System Software Architect, Enterprise Reference Architectures
$208k - $327.75k
...NVIDIA Enterprise Platforms Group is seeking a Senior System Architect to define, design, and validate enterprise AI factory reference architectures... ...system architecture, customer requirements, and hands-on infrastructure validation, helping turn NVIDIA accelerated computing,...
Senior
NVIDIA
Santa Clara, CA
4 days ago
Senior Solutions Architect, Datacenter CPUs
$184k - $287.5k
...Solutions Architect For Arm-Based Server CPUs Are you passionate... ...validate multi-tenant cloud infrastructure based on ARM server CPUs,... ...solve complex scalability and reliability issues across the CPU, memory... ...in solution architecture, systems engineering, performance engineering...
Senior
NVIDIA
Santa Clara, CA
4 days ago
Senior Principal Cloud Systems Architect
...Clara is seeking a visionary technical leader for Oracle Cloud Infrastructure (OCI). You will provide technical leadership and mentorship... ...Ideal candidates will have extensive experience in distributed systems and cloud infrastructure. The position offers opportunities...
Senior
Ll Oefentherapie
Santa Clara, CA
2 days ago
Senior Network Solution Architect - AI Fabrics
$184k - $287.5k
...experienced Network Solutions Architect Engineer to help bring our... ...server, network, and cluster infrastructure in customer data centers.... ...on advanced GPU and network systems (Spectrum-X, BlueField DPU,... ...software to deliver performant, reliable AI clusters. Identify and...
Senior
Remote work
NVIDIA
Santa Clara, CA
12 days ago
Senior Cloud Infrastructure Program Manager
...programs and lead cross-functional teams in delivering critical infrastructure initiatives. This hybrid position focuses on large-scale infrastructure projects, including cloud migrations and reliability improvements. The ideal candidate has over 7 years of experience,...
Senior
CrowdStrike Holdings, Inc.
Sunnyvale, CA
3 days ago
Solutions Architect, OEM AI Factory Infrastructure
$152k - $241.5k
...Cluster Administration and Site Reliability Engineering. You will... ...should be familiar with Linux system administration, Python, and... ...Collaborating with solution architects, engineering or product teams... ...previous work with data center infrastructure experience, from hardware up...
Work experience placement
Work at office
NVIDIA
Santa Clara, CA
5 days ago
Senior Machine Learning Infrastructure Engineer
$183.7k - $248.6k
...opportunity Unity is looking for a Senior Machine Learning Infrastructure Engineer to join our Vector Ads team, where we build the real-time systems that power Unity's global advertising... ...bidding, and targeting systems run reliably at scale. This is a great opportunity...
Senior
Work at office
Remote work
Worldwide
Relocation package
Unity
Mountain View, CA
3 days ago
Senior Wireless Network SRE & Reliability Engineer
A leading technology firm is in search of a Senior Wireless Network Site Reliability Engineer to manage and enhance their wireless network infrastructure. The ideal candidate has over 8 years of experience in wireless network operations and a strong background in wireless...
Senior
TechDigital Group
Santa Clara, CA
5 days ago
Senior Cloud SRE — AWS/Azure Reliability, Onsite
Illumio is seeking a Senior Site Reliability Engineer to enhance reliability and performance in their cloud-based systems in Sunnyvale, California. The ideal candidate will have... ...experience in managing AWS and Azure infrastructures, a strong passion for automation, and...
Senior
Illumio
Sunnyvale, CA
3 days ago
Senior Kubernetes Infrastructure Engineer - Remote & Scale Global Platform
CrowdStrike, Inc. is seeking a Senior Infrastructure Engineer based in Sunnyvale, California, to help expand the Falcon platform globally... ...large-scale Kubernetes solutions and optimizing system reliability. The ideal candidate will possess extensive experience with...
Senior
Remote job
Koitecc Solutions
Sunnyvale, CA
5 days ago
Senior Manager, Technical Solutions Manager
$207k - $275k
...CoreWeave combines superior infrastructure performance with deep technical... ...at scale has a seamless, reliable, and high-performance experience... ...data centers, hardware systems, and customer workloads to maintain... ...search for a remarkable Senior Manager of Technical...
Senior
Permanent employment
Temporary work
Casual work
Work at office
Remote work
Flexible hours
CoreWeave
Sunnyvale, CA
3 days ago
Senior Site Reliability Engineer — Scale, Automation & Uptime
$145k - $165k
...looking for a highly experienced Site Reliability Engineer (SRE). This role involves maintaining uptime and performance across systems. Exceptional Linux expertise and automation... ...include designing resilient infrastructure, monitoring environments, and responding...
Senior
Bolt Graphics, Inc.
Sunnyvale, CA
1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior System Architect, Infrastructure Reliability. Be the first to apply!