Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior System Architect, Infrastructure Reliability

$184k - $287.5k

NVIDIA Gruppe

Senior System Architect: Heterogeneous EDA Systems NVIDIA is seeking an engineer to solve a complex challenge in accelerated computing: Failure Attribution at Scale. As EDA or equivalent workloads scale across thousands of heterogeneous nodes, a single failure can cause massive resource waste. The role is to develop and build an automated framework that ingests telemetry from CPU and GPU clusters to identify the root cause of job failures in real‑time, distinguishing between hardware faults, infrastructure instability, and software defects. What you’ll be doing Architect Failure Attribution Frameworks: Build a scalable 'flight recorder' for EDA jobs that captures high‑fidelity state across the CPU, GPU, and Fabric at the moment of failure. Build automated diagnostics that correlate GPU XID errors, PCIe bus failures, and CUDA memory exceptions. Connect these errors with system‑level events such as OOM kills or NUMA‑related hangs. Distributed Logging & Tracing: Implement low‑overhead tracing mechanisms (using tracing tools or custom agents) that provide access to job execution across multi‑node Slurm or Kubernetes clusters. Root Cause Automation: Develop heuristics and models based on machine learning to classify failures as 'Hardware Fault,' 'Software Bug,' or 'Environment Issue.' This reduces the Mean Time to Identify (MTTI) for R&D teams. Resiliency Engineering: Work closely with hardware and infrastructure teams to define 'signals of impending failure,' enabling proactive job migration or checkpointing before a crash occurs. What we need to see Distributed Systems Mastery: BS, MS, or PhD in Computer Science or Electrical Engineering (or equivalent experience) with 6+ years in systems programming. Experience building automated RCA pipelines for HPC or cloud‑scale environments. CPU Architecture Deep‑Dive: Expert knowledge of x86/ARM node‑level metrics: IPC, cache contention, NUMA imbalance, and hardware interrupts. Programming Proficiency: Strong C++ and Python skills, with the ability to build high‑performance daemons that monitor system health without impacting workload performance. Scale Experience: Familiarity with cluster resource managers (Slurm, LSF, or Kubernetes) and how they manage job lifecycle and signal propagation. Ways to Stand Out from the Crowd Low‑Level Diagnostics: Expert knowledge of the Linux kernel and its error‑reporting interfaces (/dev/mcelog, dmesg, journald). Understand how the kernel handles hardware exceptions and memory faults. GPU Infrastructure Proficiency: Deep experience with the NVIDIA DCGM and NVIDIA Management Library (NVML) for monitoring device health and capturing state‑dumps. Experience with tools doing non‑intrusive monitoring of application health and syscall‑level failure patterns. Experience with checkpoint/restore technologies (like CRIU) and their application in long‑running EDA flows. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering a diverse work environment and proud to be an equal‑opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. #J-18808-Ljbffr

Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Senior System Architect, Infrastructure Reliability in Santa Clara, CA vacancy
  •  ...NVIDIA is seeking a System Architect to lead rack‑level and platform pathfinding for our next...  ...center teams to deliver high‑performance, reliable AI platforms. Other responsibilities...  .... Solid understanding of data center infrastructure: rack power distribution, network... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • $152k - $241.5k

     ...a large scale! We are seeking a hands‑on Solutions Architect with deep expertise in backend infrastructure, inference and cloud‑native applications to design...  ...Architecture or Infrastructure Engineering, advancing AI/ML systems from proof of concept to production on private/... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • $152k - $241.5k

     ...seeking a hands-on, action-oriented Senior Solutions Architect to join our team, focused on the technical...  ...role requires a strong passion for system design and a successful history of...  ...high-performance, distributed AI infrastructure on-prem or in the cloud built with the... 
    Senior

    NVIDIA

    Santa Clara, CA
    4 days ago
  •  ...team owns the cloud‑agnostic, reliable, and cost‑efficient platform...  ...(AV) efforts. We provide an infrastructure platform for teams...  ...About the Role We are seeking a Senior ML Infrastructure Engineer to...  ...hardware‑in‑the‑loop validation systems. Experience with high... 
    Senior
    Local area
    Work from home

    General Motors

    Sunnyvale, CA
    1 day ago
  •  ...team owns the cloud-agnostic, reliable, and cost-efficient platform...  ...We’re proud to serve as the infrastructure platform for teams...  ...Role: We are seeking a Senior ML Infrastructure engineer...  ...running scalable distributed systems. They will rapidly test and... 
    Senior
    Local area
    Work from home

    General Motors

    Sunnyvale, CA
    7 days ago
  • $160k - $200k

     ...join its fast-growing teams. As a Senior ML Infrastructure Engineer at Plus, you will design scalable...  ...for managing model versioning systems and experiment tracking frameworks, which...  .... Ensure high availability and reliability of the ML platform by implementing robust... 
    Senior

    PlusAI, Inc.

    Santa Clara, CA
    5 days ago
  • $153.2k - $234.1k

     ...breakthrough hardware and battery systems to intuitive design,...  ...solutions that support safe and reliable autonomous vehicle behavior...  ...-world scenarios. As a Senior ML Infra Engineer, you will...  ...systems, applications, or ML infrastructure. ~ Experience designing robust... 
    Senior
    Local area
    Remote work
    Work from home
    Relocation package
    Flexible hours

    General Motors

    Sunnyvale, CA
    2 days ago
  • $153.2k - $234.1k

     ...breakthrough hardware and battery systems to intuitive design,...  ...where we build the critical infrastructure that powers every machine learning...  ...to use, and exceptionally reliable. Your success will be...  ...driverless vehicles. As a Senior ML Infra Engineer, you will... 
    Senior
    Work at office
    Local area
    Remote work
    Work from home
    Relocation
    Relocation package
    Flexible hours

    General Motors

    Sunnyvale, CA
    6 days ago
  •  ...execution of power solution projects. You will play a crucial role in understanding and simulating power architectures, ensuring reliable and cost-effective solutions. The ideal candidate will possess a Master's or Ph.D. in Electrical Engineering and over 7 years of experience... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • $272k - $431.25k

     ...passionate about developing cloud infrastructure, we want to hear from you!...  ...and creative solutions architect with experience in network interconnect...  ...networking and low‑latency systems. Experience designing large‑...  ...by improving network reliability and uptime. Efficient inter‑... 

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • $195k - $220k

     ...Principal Power Systems Architect, Quantum Infrastructure Milpitas, California, United States PsiQuantum's mission is to build the first useful...  ...radiated EMI mitigation, noise suppression, and system reliability . This person will help shape the electrical foundation... 
    Full time
    Shift work

    PsiQuantum

    Milpitas, CA
    3 days ago
  •  ...NVIDIA Gruppe in Santa Clara is seeking a Senior Architect to lead projects in AI infrastructure. You will architect and implement high-performance communication...  ...with cutting-edge technologies in networking and systems software. The ideal candidate has over 12 years of... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • $184k - $287.5k

     ...world. NVIDIA is looking for a highly motivated Datacenter Systems Architect to join our multifaceted and innovative System Engineering...  ...optimal balance of performance, quality, scalability, re-use, reliability, cost and time-to-market Document designs and plans to... 
    Senior

    NVIDIA

    Santa Clara, CA
    5 days ago
  • $184k - $287.5k

     ...creativity and intelligence. NVIDIA is looking for a Datacenter System Architect to help define & design products for AI, high performance...  ...our Datacenter System Architecture team and help build the infrastructure for the next industrial revolution. Your base salary... 
    Senior

    NVIDIA

    Santa Clara, CA
    3 days ago
  • $184k - $287.5k

     ...and maintain our leadership. NVIDIA is seeking a motivated system architect to define future aspects of our GPU through employing pioneering...  ...workloads. Develop and enhance architecture analysis infrastructure, including performance simulators, testbench components and... 
    Senior
    Work experience placement
    Night shift

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • $152k - $287.5k

     ...NVIDIA Gruppe, based in Santa Clara, is seeking a Senior Software Engineer to accelerate the development of machine learning innovations...  ...and Docker, along with a passion for improving operational reliability. Competitive salary range of $152,000 - $287,500 based on... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  •  ...NVIDIA Enterprise Platforms Group is seeking a Senior System Architect to define, design, and validate enterprise AI factory reference architectures...  ...system architecture, customer requirements, and hands‑on infrastructure validation, helping turn NVIDIA accelerated computing,... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • Acceler8 Talent is seeking a Senior / Principal Machine Learning Engineer specializing...  ...designing and optimizing multi-node inference systems, enhancing resource scheduling...  ...startup environment to work on cutting-edge AI infrastructure. #J-18808-Ljbffr Acceler8 Talent
    Senior

    Acceler8 Talent

    Santa Clara, CA
    2 days ago
  • $237k - $329k

     ...global technology company based in Sunnyvale is seeking a Senior Staff Thermal Systems Architect for their Google Cloud division. This specialized role...  ..., mentoring teams, and optimizing performance and reliability of electronic equipment. The position offers a competitive... 
    Senior

    Google Inc.

    Sunnyvale, CA
    2 days ago
  • $207k - $300k

     ...driving the technical strategy and execution for complex AI projects, ensuring scalable solutions and optimizing machine learning infrastructure. Potential candidates should have a Bachelor's degree and substantial experience in software development and project leadership... 
    Senior

    Google Inc.

    Sunnyvale, CA
    3 days ago
  • Google Inc. is seeking a Senior Business Systems Analyst in Sunnyvale, CA to enhance internal finance processes, particularly focusing on asset...  ...to large-scale improvement projects, and ensure reliability and efficiency in processes that leverage innovative solutions... 
    Senior

    Google Inc.

    Sunnyvale, CA
    5 days ago
  • $280k - $342k

    Dell Technologies is seeking a Technical Staff, Datacenter Architect in Santa Clara, California, to develop next-generation large-scale AI infrastructure. The role involves engaging with high-profile AI customers, innovating power-centric solutions, and optimizing performance... 
    Senior

    Dell Technologies

    Santa Clara, CA
    5 days ago
  • $148k - $287.5k

     ...Senior Software and System Architect page is loaded Senior Software and System Architect Apply locations US, CA, Santa Clara US, CA, Remote US, NY,...  ...solutions to complex problems Writing effective, clear and reliable architecture specifications Evaluating new technologies,... 
    Senior
    Full time
    Remote work

    NVIDIA

    Santa Clara, CA
    1 day ago
  • $184k - $287.5k

     ...Senior Solutions Architect, Datacenter CPUs page is loaded## Senior Solutions...  ...validate multi-tenant cloud infrastructure based on ARM server CPUs,...  ...solve complex scalability and reliability issues across the CPU,...  ...in solution architecture, systems engineering, performance engineering... 
    Senior

    NVIDIA

    Santa Clara, CA
    1 day ago
  • $120.3k - $194.53k

     ...Palo Alto Networks, Inc. is searching for a Senior Site Reliability Engineer in Santa Clara, California. This role involves designing and operating cloud infrastructure across GCP and AWS while leveraging AI to enhance operational efficiency. Candidates should have at... 
    Senior

    Palo Alto Networks

    Santa Clara, CA
    1 day ago
  •  ...Clara is seeking a visionary technical leader for Oracle Cloud Infrastructure (OCI). You will provide technical leadership and mentorship...  ...Ideal candidates will have extensive experience in distributed systems and cloud infrastructure. The position offers opportunities... 
    Senior

    Ll Oefentherapie

    Santa Clara, CA
    5 days ago
  •  ...programs and lead cross-functional teams in delivering critical infrastructure initiatives. This hybrid position focuses on large-scale infrastructure projects, including cloud migrations and reliability improvements. The ideal candidate has over 7 years of experience,... 
    Senior

    CrowdStrike Holdings, Inc.

    Sunnyvale, CA
    1 day ago
  • $184k - $356.5k

     ...NVIDIA Corporation is seeking a Senior Software Engineer in Santa Clara to enhance the performance and reliability of large-scale AI infrastructures. The role involves leadership in debugging...  ...experience in large-scale AI systems, proficiency in Python and C/C++, and... 
    Senior

    NVIDIA

    Santa Clara, CA
    2 days ago
  •  ...experienced Network Solutions Architect Engineer to help bring our...  ...server, network, and cluster infrastructure in customer data centers....  ...on advanced GPU and network systems (Spectrum-X, BlueField DPU,...  ...software to deliver performant, reliable AI clusters. Identify and... 
    Senior
    Remote work

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • Illumio is seeking a Senior Site Reliability Engineer to enhance reliability and performance in their cloud-based systems in Sunnyvale, California. The ideal candidate will have...  ...experience in managing AWS and Azure infrastructures, a strong passion for automation, and... 
    Senior

    Illumio

    Sunnyvale, CA
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior System Architect, Infrastructure Reliability. Be the first to apply!