Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior System Architect, Infrastructure Reliability

$184k - $287.5k
Full-time

NVIDIA

NVIDIA is seeking a Senior System Architect: Heterogeneous EDA Systems to solve a complex challenge in accelerated computing: Failure Attribution at Scale. As EDA or equivalent experience workloads scale across thousands of heterogeneous nodes, a single failure can cause massive resource waste. We need an engineer to develop and build an automated framework. This framework will ingest telemetry from CPU and GPU clusters to identify the root cause of job failures in real-time. It will distinguish between hardware faults, infrastructure instability, and software defects. What you'll be doing: Architect Failure Attribution Frameworks: Build a scalable "flight recorder" for EDA jobs that captures high-fidelity state across the CPU, GPU, and Fabric at the moment of failure. Build automated diagnostics that correlate GPU XID errors, PCIe bus failures, and CUDA memory exceptions. Connect these errors with system-level events such as OOM kills or NUMA-related hangs. Distributed Logging & Tracing: Implement low-overhead tracing mechanisms (using tracing tools or custom agents) that provide access to job execution across multi-node Slurm or Kubernetes clusters. Root Cause Automation: Develop heuristics and models based on machine learning to classify failures as "Hardware Fault," "Software Bug," or "Environment Issue." This reduces the Mean Time to Identify (MTTI) for R&D teams. Resiliency Engineering: Work closely with hardware and infrastructure teams to define "signals of impending failure," enabling proactive job migration or check-pointing before a crash occurs. What we need to see: Distributed Systems Mastery: BS, MS, or PhD in Computer Science or Electrical Engineering (or equivalent experience) with 6+ years in systems programming. Experience building automated RCA (Root Cause Analysis) pipelines for HPC or cloud-scale environments. CPU Architecture Deep-Dive: Expert knowledge of x86/ARM node-level metrics: IPC (Instructions Per Cycle), cache contention, NUMA imbalance, and hardware interrupts. Programming Proficiency: Strong C++ and Python skills, with the ability to build high-performance daemons that monitor system health without impacting workload performance. Scale Experience: Familiarity with cluster resource managers (Slurm, LSF, or Kubernetes) and how they manage job lifecycle and signal propagation. Ways To Stand Out From The Crowd: Low-Level Diagnostics: Expert knowledge of the Linux kernel and its error-reporting interfaces (/dev/mcelog, dmesg, journald). Understand how the kernel handles hardware exceptions and memory faults. GPU Infrastructure Proficiency: Deep experience with the NVIDIA DCGM (Data Center GPU Manager) and NVIDIA Management Library (NVML) for monitoring device health and capturing state-dumps. Experience with tools doing non-intrusive monitoring of application health and syscall-level failure patterns. Experience with checkpoint/restore technologies (like CRIU) and their application in long-running EDA flows. #LI-Hybrid Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until June 19, 2026. This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. NVIDIA pioneered accelerated computing. Today, our AI infrastructure powers global intelligence, transforming every industry. Learn more about NVIDIA.

Vacancy posted 3 hours ago
Similar jobs that could be interesting for youBased on the Senior System Architect, Infrastructure Reliability in Santa Clara, CA vacancy
  •  ...team owns the cloud-agnostic, reliable, and cost-efficient platform...  ...We’re proud to serve as the infrastructure platform for teams...  ...Role: We are seeking a Senior ML Infrastructure engineer...  ...running scalable distributed systems. They will rapidly test and... 
    Senior
    Local area
    Work from home

    General Motors

    Sunnyvale, CA
    3 days ago
  • $160k - $200k

     ...join its fast-growing teams. As a Senior ML Infrastructure Engineer at Plus, you will design scalable...  ...for managing model versioning systems and experiment tracking frameworks, which...  .... Ensure high availability and reliability of the ML platform by implementing robust... 
    Senior

    PlusAI, Inc.

    Santa Clara, CA
    1 day ago
  • $153.2k - $234.1k

     ...breakthrough hardware and battery systems to intuitive design,...  ...solutions that support safe and reliable autonomous vehicle behavior...  ...-world scenarios. As a Senior ML Infra Engineer, you will...  ...systems, applications, or ML infrastructure. ~ Experience designing robust... 
    Senior
    Local area
    Remote work
    Work from home
    Relocation package
    Flexible hours

    General Motors

    Sunnyvale, CA
    3 days ago
  • $153.2k - $234.1k

     ...breakthrough hardware and battery systems to intuitive design,...  ...where we build the critical infrastructure that powers every machine learning...  ...to use, and exceptionally reliable. Your success will be...  ...driverless vehicles. As a Senior ML Infra Engineer, you will... 
    Senior
    Work at office
    Local area
    Remote work
    Work from home
    Relocation
    Relocation package
    Flexible hours

    General Motors

    Sunnyvale, CA
    2 days ago
  •  ...Power Systems Architect, Quantum Infrastructure PsiQuantum's mission is to build the first useful quantum computers—machines capable of delivering...  ...radiated EMI mitigation, noise suppression, and system reliability. This person will help shape the electrical foundation... 
    Suggested
    Shift work

    Blackbird

    Milpitas, CA
    1 day ago
  •  ...TITLE: ML Data Infrastructure Engineer LOCATION: Sunnyvale CA or Remote Duration: 12+ Months Rate: DOE Key skills - GCP ML...  ...our ML data infrastructure platform . You'll create the systems and tools that enable efficient data preparation, feature... 
    Senior
    Remote work

    Redolent

    Sunnyvale, CA
    3 days ago
  • $175k - $290k

     ...Senior Software Infrastructure Engineer Santa Clara, CA This role is part of the Software Infrastructure...  ...enable development of ML accelerator systems across both hardware and software...  ...and hardware teams to ensure reliable, scalable, and efficient development... 
    Senior
    Remote work

    Phizenix

    Santa Clara, CA
    2 days ago
  • $170k - $240k

     ...driven expert in ML Training Infrastructure with a strong ability to execute...  ...and building scalable, reliable, and high-performance AI/ML platform...  ...initiatives. As a Senior ML Engineer, you will collaborate...  ...save cost. Raise the bar on system observability, debuggability,... 
    Senior
    Local area
    Remote work
    Work from home
    Relocation
    Relocation package
    Flexible hours

    General Motors

    Mountain View, CA
    1 day ago
  •  ...Senior Principal AI System Architect A System Architect focusing on advanced memory technology and AI Inference solution infrastructure will lead the definition of system level solutions with emphasis on compute memory bottlenecks, bandwidth and latency management.... 
    Senior

    SanDisk

    San Jose, CA
    5 days ago
  • $150k

     ...The Role We're looking for a distributed ML infrastructure engineer to help extend and scale our training systems. You’ll work side‑by‑side with world‑class researchers...  ...external visibility Improve training system reliability, maintainability, and performance While much... 
    Flexible hours

    Institute of Foundation Models

    Sunnyvale, CA
    1 day ago
  • $153.82k - $269.55k

     ...Senior Electrical Systems Architect We are seeking a Senior Electrical Systems Architect to define and lead the platform-level electronics architecture...  ...Lead architecture tradeoffs balancing performance, reliability, serviceability, cost, and schedule. Drive... 
    Senior
    Full time
    Local area
    Shift work
    3 days per week

    Agilent

    Santa Clara, CA
    2 days ago
  • $207k - $275k

     ...Senior Manager, Technical Solutions Manager Sunnyvale, CA CoreWeave...  ...CoreWeave combines superior infrastructure performance with deep...  ...workloads at scale has a seamless, reliable, and high-performance...  ...across data centers, hardware systems, and customer workloads to... 
    Senior
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    2 days ago
  •  ...Cerebras Systems builds the world's largest AI chip, 56 times larger...  ...ensure Cerebras systems are reliably deployed, operated, and...  ...Systems Engineering AI Cloud Infrastructure & Operations Network &...  ...metrics, and operational risks to senior leadership Required... 
    Senior

    CEREBRAS SYSTEMS INC.

    Sunnyvale, CA
    4 days ago
  • $166.52k - $249.5k

     ...Principal System Architect Marvell's semiconductor solutions are the essential building blocks of the data infrastructure that connects our world. Across enterprise, cloud and AI, and...  ...improvement. Writing effective, clear and reliable architecture specifications... 
    Permanent employment
    Work experience placement
    Internship
    Work from home

    Marvell

    Santa Clara, CA
    4 days ago
  •  ...the Flash memory it needs to keep our world moving forward. Job Description A System Architect focusing on advanced memory technology and AI Inference solution infrastructure will lead the definition of system level solutions with emphasis on compute memory bottlenecks... 
    Senior
    Temporary work
    Remote work
    Flexible hours
    Shift work

    Sandisk

    Milpitas, CA
    19 days ago
  • $198.7k - $298.1k

     ...Engineering General Summary: We are seeking an experienced CPU System Architect to join our team. If you possess a deep understanding of CPU...  ...meet the needs of stakeholders, including engineering teams, senior management and internal partners. Develop High-Level... 
    Senior
    Work experience placement
    Work from home

    Qualcomm

    Santa Clara, CA
    3 days ago
  • $149.1k - $215.93k

     ...About the Role We are looking for a Senior MLOps & AI Infrastructure Engineer to architect, build, and operationalize machine learning systems at scale. This role sits at the...  ...MLOps discipline required to ship models reliably into production. You will partner closely... 
    Senior
    Local area
    Shift work

    Altera

    San Jose, CA
    1 day ago
  • $190k - $220k

     ...workstations, smartphones, tablets), infrastructure (server, storage, edge, high performance...  .... We are hiring a CrossDevice OS System Architect to define the endtoend architecture of...  ...highimpact individual contributor role for a senior technologist responsible for shaping... 
    Senior
    Full time
    Local area

    Lenovo

    San Jose, CA
    5 days ago
  •  ...Networks, Secure Cloud and AI infrastructure is the foundation of our...  ...organization. You will be the key architect of our strategy to embed...  ...and implementation of novel systems that leverage Large Language...  ...cloud platforms, mentoring senior engineers and infusing... 
    Senior
    Full time
    Work at office
    3 days per week

    Palo Alto Networks

    Santa Clara, CA
    6 days ago
  • $160k - $210k

     ...deliver faster, cheaper, more reliable connectivity. Lead the...  ...About the Role As a Senior Backend Software Engineer, Cloud & Infrastructure at Taara , you will be the architect of the "brains" behind our...  ...and scale the distributed systems, APIs, and cloud-native infrastructure... 
    Senior
    Full time
    Work at office
    Night shift
    3 days per week

    Taara Connect, Inc

    Sunnyvale, CA
    1 day ago
  • $165k - $242k

     ...Senior Software Engineer, Data Center Infrastructure Tooling CoreWeave is The Essential Cloud for AI™. Built for...  ...Integrations with internal/external systems and data sources that feed...  ...CD pipelines, observability, and reliability practices. What We're Looking... 
    Senior

    CoreWeave

    Sunnyvale, CA
    5 days ago
  •  ...Senior Manager, AI Infrastructure Network Operations The OCI AI Infrastructure Network Operations team...  ...RDMA network fabrics and supporting systems. This role requires deep networking...  ...systems that make these fabrics more reliable, observable, and efficient at global... 
    Senior
    Temporary work
    Flexible hours
    Night shift

    Oracle

    Santa Clara, CA
    18 hours ago
  • $225k - $325k

     ...Senior Machine Learning Engineer ABOUT THE ROLE This is...  ...latency-sensitive, high-traffic systems. You’ll own model...  ...and ensure they stay fast, reliable, and accurate at scale. Run...  ...model iterations. Level Up Infrastructure – Design and maintain the ML... 
    Senior
    H1b

    kadence

    Sunnyvale, CA
    4 days ago
  •  ...Senior Principal Ai Agent / Ml Software Engineer The...  ...operating next-generation AI systems on Oracle Cloud Infrastructure (OCI). This person will...  ...ship, scale, and operate reliable, secure, observable, and...  ...observability. Design, architect, and deliver scalable agentic... 
    Senior

    Oracle

    Santa Clara, CA
    3 days ago
  • $160.98k - $227.27k

     ...generations. We are seeking a Senior Infrastructure and DevOps Engineer to...  ...cross-functionally with architects, modeling engineers, and software...  ...run efficiently, reliably, and at scale across Linux-...  ...maintain automation for build systems, toolchains, packaging, and... 
    Senior
    Internship
    Local area
    Immediate start
    Shift work

    Intel

    Santa Clara, CA
    5 days ago
  •  ...Sr. Solutions Architect Architect Our business is undergoing transformation...  ...us in moving the monolithic systems to modern cloud-based...  ...for large or complex cloud infrastructure solutions. In instances...  ...performance, scalability, reliability, and security needs. Be a... 
    Senior
    Shift work

    Professional Recruiters

    Santa Clara, CA
    5 days ago
  • OCI Network Availability is seeking a Senior Manager to lead a Networking Reliability Engineering team responsible for driving operational excellence...  ...at scale. Only Oracle brings together the data, infrastructure, applications, and expertise to power everything from... 
    Senior
    Full time
    Flexible hours

    Oracle

    Santa Clara, CA
    3 hours ago
  • $188.3k - $269.28k

     ...technology, have kept systems in sync, but they...  ...lower power, and better reliability. With more than 4 billion...  ...Networking System Architect serves as the...  ...applications. In this senior technical leadership...  ..., datacenter, and AI infrastructure communities. Serve... 

    SiTime

    Santa Clara, CA
    2 days ago
  • $220.2k - $330.4k

     ...Group, Engineering Group Systems Engineering General Summary...  ...are hiring a Rack Systems Architect to define and deliver rack-scale...  ...of silicon, systems, and infrastructure, you will drive architecture...  ...performance, scalability, reliability, and total cost of ownership... 
    Work experience placement
    Work from home

    Qualcomm

    Santa Clara, CA
    2 days ago
  • $160k - $188.23k

     ...We are searching for a Senior Member of Technical Staff, Software...  ...) to join the Aviatrix Test Infrastructure team. In this role, you will...  ...infrastructure, ensuring reliability, scalability, and reproducibility...  ...simulators for distributed systems Understanding of... 
    Senior
    Full time
    Temporary work
    Local area
    Remote work
    Flexible hours
    Day shift

    Aviatrix

    Santa Clara, CA
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior System Architect, Infrastructure Reliability. Be the first to apply!