Senior System Architect, Infrastructure Reliability

NVIDIA

NVIDIA is seeking a Senior System Architect: Heterogeneous EDA Systems to solve a complex challenge in accelerated computing: Failure Attribution at Scale. As EDA or equivalent experience workloads scale across thousands of heterogeneous nodes, a single failure can cause massive resource waste. We need an engineer to develop and build an automated framework. This framework will ingest telemetry from CPU and GPU clusters to identify the root cause of job failures in real-time. It will distinguish between hardware faults, infrastructure instability, and software defects. What you’ll be doing: Architect Failure Attribution Frameworks: Build a scalable ‘flight recorder’ for EDA jobs that captures high‑fidelity state across the CPU, GPU, and Fabric at the moment of failure. Distributed Logging & Tracing: Implement low‑overhead tracing mechanisms (using tracing tools or custom agents) that provide access to job execution across multi‑node Slurm or Kubernetes clusters. Root Cause Automation: Develop heuristics and models based on machine learning to classify failures as ‘Hardware Fault’, ‘Software Bug’, or ‘Environment Issue’. This reduces the Mean Time to Identify (MTTI) for R&D teams. Resiliency Engineering: Work closely with hardware and infrastructure teams to define ‘signals of impending failure’, enabling proactive job migration or checkpointing before a crash occurs. What we need to see: Distributed Systems Mastery: BS, MS, or PhD in Computer Science or Electrical Engineering (or equivalent experience) with 6+ years in systems programming. Experience building automated RCA (Root Cause Analysis) pipelines for HPC or cloud‑scale environments. CPU Architecture Deep‑Dive: Expert knowledge of x86/ARM node‑level metrics: IPC (Instructions Per Cycle), cache contention, NUMA imbalance, and hardware interrupts. Programming Proficiency: Strong C++ and Python skills, with the ability to build high‑performance daemons that monitor system health without impacting workload performance. Scale Experience: Familiarity with cluster resource managers (Slurm, LSF, or Kubernetes) and how they manage job lifecycle and signal propagation. Ways to stand out from the crowd: Low‑Level Diagnostics: Expert knowledge of the Linux kernel and its error‑reporting interfaces (/dev/mcelog, dmesg, journald). Understand how the kernel handles hardware exceptions and memory faults. GPU Infrastructure Proficiency: Deep experience with the NVIDIA DCGM (Data Center GPU Manager) and NVIDIA Management Library (NVML) for monitoring device health and capturing state‑dumps. Experience with tools doing non‑intrusive monitoring of application health and syscall‑level failure patterns. Experience with checkpoint/restore technologies (like CRIU) and their application in long‑running EDA flows. Compensation & Benefits Base salary will be determined based on location, experience, and comparable positions: $184,000–$287,500 for Level4, and $224,000–$356,500 for Level5. You will also be eligible for equity and benefits. NVIDIA is committed to fostering an inclusive work environment and is proud to be an equal‑opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law. #J-18808-Ljbffr

Apply

Vacancy posted 5 days ago

Similar jobs that could be interesting for youBased on the Senior System Architect, Infrastructure Reliability in Santa Clara, CA vacancy

Senior System Architect, Infrastructure Reliability
$184k - $287.5k
...Senior System Architect: Heterogeneous EDA Systems NVIDIA is seeking an engineer to solve a complex challenge in accelerated computing: Failure... ...in real‑time, distinguishing between hardware faults, infrastructure instability, and software defects. What you’ll be doing...
Senior
NVIDIA Gruppe
Santa Clara, CA
1 day ago
Senior System Architect - LPU Platform Pathfinding
NVIDIA is seeking a System Architect to lead rack‑level and platform pathfinding for our next... ...center teams to deliver high‑performance, reliable AI platforms. Other responsibilities... .... Solid understanding of data center infrastructure: rack power distribution, network...
Senior
NVIDIA Gruppe
Santa Clara, CA
10 hours ago
Senior Solutions Architect - Data Center Infrastructure
$152k - $241.5k
...seeking a hands-on, action-oriented Senior Solutions Architect to join our team, focused on the technical... ...role requires a strong passion for system design and a successful history of... ...high-performance, distributed AI infrastructure on-prem or in the cloud built with the...
Senior
NVIDIA
Santa Clara, CA
4 days ago
Senior Solutions Architect, Robotics Infrastructure
$152k - $241.5k
...a large scale! We are seeking a hands‑on Solutions Architect with deep expertise in backend infrastructure, inference and cloud‑native applications to design... ...Architecture or Infrastructure Engineering, advancing AI/ML systems from proof of concept to production on private/...
Senior
NVIDIA Gruppe
Santa Clara, CA
10 hours ago
Senior Solutions Architect, Cloud Infrastructure and DevOps - NVIS
...the worlds hardest problems.NVIDIA is looking for Senior Cloud Infrastructure/DevOps Solutions Architect to join its NVIDIA Infrastructure Specialist Team.... ...team building many of the largest and fastest AI/HPC systems in the world! We are looking for someone with the ability...
Senior
NVIDIA Corporation
Santa Clara, CA
1 day ago
Senior Accelerators Systems Software Architect, AI Transformation
$262k - $365k
Senior Accelerators Systems Software Architect, AI Transformation corporate_fare Google place Sunnyvale, CA, USA... ...a strong focus on quality and reliability throughout the manufacturing and... ...deployment life‑cycle. The AI and Infrastructure team is redefining what’s...
Senior
Worldwide
Google Inc.
Sunnyvale, CA
1 day ago
Senior Machine Learning Infrastructure Engineer
$160k - $200k
...join its fast-growing teams. As a Senior ML Infrastructure Engineer at Plus, you will design scalable... ...for managing model versioning systems and experiment tracking frameworks, which... .... Ensure high availability and reliability of the ML platform by implementing robust...
Senior
PlusAI, Inc.
Santa Clara, CA
5 days ago
Senior ML Infrastructure Engineer (Compute)
...team owns the cloud‑agnostic, reliable, and cost‑efficient platform... ...(AV) efforts. We provide an infrastructure platform for teams... ...About the Role We are seeking a Senior ML Infrastructure Engineer to... ...hardware‑in‑the‑loop validation systems. Experience with high...
Senior
Local area
Work from home
General Motors
Sunnyvale, CA
10 hours ago
Senior Site Reliability Engineer - Cloud AI Infrastructure
Cerebras is looking for a Senior Site Reliability Engineer to join their Infrastructure team in Palo Alto, California. This role involves designing and optimizing... ...background in cloud-native technologies and distributed systems. The position offers the chance to work with...
Senior
Cerebras
Palo Alto, CA
3 days ago
Senior Power Systems Architect | Equity & Benefits
...execution of power solution projects. You will play a crucial role in understanding and simulating power architectures, ensuring reliable and cost-effective solutions. The ideal candidate will possess a Master's or Ph.D. in Electrical Engineering and over 7 years of experience...
Senior
NVIDIA Gruppe
Santa Clara, CA
10 hours ago
Senior Software and System Architect
$152k - $241.5k
...and amazing people. NVIDIA is looking for an experienced Senior Software and System Architect to join our Networking Software Architecture group.... ...solutions to complex problems Writing effective, clear and reliable architecture specifications Evaluating new technologies,...
Senior
NVIDIA Gruppe
Santa Clara, CA
2 days ago
Senior Hardware Systems Architect for Data Center ML/AI
$159k - $231k
Google Inc. is seeking a Senior Hardware Systems Design Engineer in Sunnyvale, CA. This role involves working on innovative ML/AI hardware... ...the boundaries of technology. As a member of the Platforms Infrastructure team, you will lead the design and validation of hardware...
Senior
Google Inc.
Sunnyvale, CA
1 day ago
Senior System Architect, GPU
$184k - $287.5k
...and maintain our leadership. NVIDIA is seeking a motivated system architect to define future aspects of our GPU through employing pioneering... ...workloads. Develop and enhance architecture analysis infrastructure, including performance simulators, testbench components and...
Senior
Work experience placement
Night shift
NVIDIA Gruppe
Santa Clara, CA
10 hours ago
Senior System Architect, Enterprise Reference Architectures
NVIDIA Enterprise Platforms Group is seeking a Senior System Architect to define, design, and validate enterprise AI factory reference architectures... ...system architecture, customer requirements, and hands‑on infrastructure validation, helping turn NVIDIA accelerated computing,...
Senior
NVIDIA Gruppe
Santa Clara, CA
2 days ago
Senior ML Infrastructure Engineer: End-to-End ML Pipelines
$153k - $222k
Decisive Point is hiring engineers in Sunnyvale, CA, to work on machine learning infrastructure. Responsibilities include designing GPU training approaches and building ML pipelines for product workflows. The ideal candidate should have a Bachelor's degree in Computer...
Senior
Decisive Point
Sunnyvale, CA
2 days ago
Senior Solutions Architect, Datacenter CPUs
$184k - $287.5k
...is looking for a Solutions Architect experienced in Arm-based server... ...validate multi-tenant cloud infrastructure based on ARM server CPUs,... ...solve complex scalability and reliability issues across the CPU,... ...years in solution architecture, systems engineering, performance engineering...
Senior
NVIDIA Gruppe
Santa Clara, CA
10 hours ago
Senior Infrastructure Program Manager
...Holdings, Inc. in Sunnyvale is seeking a Program Manager III for Engineering to lead infrastructure initiatives. You will drive complex projects involving cloud migrations and reliability improvements, ensuring timely delivery with engineering teams. The role requires...
Senior
Work at office
CrowdStrike Holdings, Inc.
Sunnyvale, CA
3 days ago
Senior HPC Scheduler & Reliability Engineer — Equity Options
NVIDIA Gruppe in Santa Clara is hiring for a role in their Hardware Infrastructure EDA Compute team to optimize workload scheduling systems and improve overall service reliability. The successful candidate will manage and scale job scheduling systems while driving measurable...
Senior
NVIDIA Gruppe
Santa Clara, CA
10 hours ago
Senior Cloud SRE — AWS/Azure Reliability, Onsite
Illumio is seeking a Senior Site Reliability Engineer to enhance reliability and performance in their cloud-based systems in Sunnyvale, California. The ideal candidate will have... ...experience in managing AWS and Azure infrastructures, a strong passion for automation, and...
Senior
Illumio
Sunnyvale, CA
1 day ago
Senior Wireless Network SRE & Reliability Engineer
A leading technology firm is in search of a Senior Wireless Network Site Reliability Engineer to manage and enhance their wireless network infrastructure. The ideal candidate has over 8 years of experience in wireless network operations and a strong background in wireless...
Senior
TechDigital Group
Santa Clara, CA
3 days ago
Senior Cloud Reliability Engineer for AI Infra
$156k - $190k
Crusoe Energy Systems in Sunnyvale, CA, is seeking a Staff Cloud Support Engineer to provide technical leadership in cloud infrastructure. You will lead incident responses, design reliability architecture, and mentor team members. The ideal candidate will have over 8 years...
Senior
Crusoe Energy Systems
Sunnyvale, CA
10 hours ago
Senior ML Infrastructure Engineer (Compute)
...team owns the cloud-agnostic, reliable, and cost-efficient platform... ...We’re proud to serve as the infrastructure platform for teams... ...About the Role We are seeking a Senior ML Infrastructure engineer to... ...running scalable distributed systems. They will rapidly test and...
Senior
General Motors
Mountain View, CA
4 days ago
Senior Cloud & Security Infrastructure Administrator
Inlighten Technologies, located in Santa Clara, is looking for a Cloud Systems Administrator to manage cloud infrastructure crucial for next-generation AR/AI products. The ideal candidate ensures system reliability and implements security practices. The position demands 8+...
Senior
Inlighten Technologies
Santa Clara, CA
4 days ago
Senior Machine Learning Infrastructure Engineer
$183.7k - $248.6k
...opportunity Unity is looking for a Senior Machine Learning Infrastructure Engineer to join our Vector Ads team, where we build the real-time systems that power Unity's global... ...bidding, and targeting systems run reliably at scale. This is a great opportunity...
Senior
Work at office
Remote work
Worldwide
Relocation package
Unity
Mountain View, CA
1 day ago
Senior Cloud SRE: AI-Driven Infra & Reliability
Palo Alto Networks, Inc. is seeking a Senior Site Reliability Engineer in Santa Clara, California. You will design and operate cloud infrastructure across GCP, AWS, and global data centers while leveraging AI and machine learning for transformative operational efficiency...
Senior
Palo Alto Networks, Inc.
Santa Clara, CA
3 days ago
Senior DevOps & SRE Cloud Infrastructure Architect
...Mirrors is looking for a Principal DevOps, SRE & Application Infrastructure Architect in Sunnyvale, CA. The role involves designing and... ...cloud infrastructure, and ensuring end-to-end production reliability. Candidates should have 12+ years of experience and a strong...
Senior
Contract work
Tech Mirrors
Sunnyvale, CA
1 day ago
Senior Site Reliability Engineer — Scale, Automation & Uptime
$145k - $165k
...looking for a highly experienced Site Reliability Engineer (SRE). This role involves maintaining uptime and performance across systems. Exceptional Linux expertise and automation... ...include designing resilient infrastructure, monitoring environments, and responding...
Senior
Bolt Graphics, Inc.
Sunnyvale, CA
4 days ago
Senior Network Solution Architect - AI Fabrics
...experienced Network Solutions Architect Engineer to help bring our... ...server, network, and cluster infrastructure in customer data centers.... ...on advanced GPU and network systems (Spectrum-X, BlueField DPU,... ...software to deliver performant, reliable AI clusters. Identify and...
Senior
Remote work
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior AI/ML Full‑Stack Infrastructure Engineer
$170.6k - $261.3k
...automotive company in Sunnyvale seeks a Senior AI/ML Full-Stack Engineer to design and... ...software products for machine learning infrastructure. This hands-on role requires expertise in... ...stack development, cloud technologies, and system design. The ideal candidate will have...
Senior
General Motors
Sunnyvale, CA
3 days ago
Senior SRE: Reliability Lead for Multi-Cloud & Automation
$120k - $145k
...a Staff SRE to scale FortiSASE’s cloud infrastructure. The ideal candidate will have over 7 years... ...and implementation of multi-cloud systems. Responsibilities include leading initiatives... ...optimizing performance, and improving reliability. The position offers a salary range of...
Senior
Fortinet, Inc.
Sunnyvale, CA
10 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior System Architect, Infrastructure Reliability. Be the first to apply!