Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior System Architect, Infrastructure Reliability

$184k - $287.5k

NVIDIA Gruppe

Senior System Architect: Heterogeneous EDA Systems NVIDIA is seeking an engineer to solve a complex challenge in accelerated computing: Failure Attribution at Scale. As EDA or equivalent workloads scale across thousands of heterogeneous nodes, a single failure can cause massive resource waste. The role is to develop and build an automated framework that ingests telemetry from CPU and GPU clusters to identify the root cause of job failures in real‑time, distinguishing between hardware faults, infrastructure instability, and software defects. What you’ll be doing Architect Failure Attribution Frameworks: Build a scalable 'flight recorder' for EDA jobs that captures high‑fidelity state across the CPU, GPU, and Fabric at the moment of failure. Build automated diagnostics that correlate GPU XID errors, PCIe bus failures, and CUDA memory exceptions. Connect these errors with system‑level events such as OOM kills or NUMA‑related hangs. Distributed Logging & Tracing: Implement low‑overhead tracing mechanisms (using tracing tools or custom agents) that provide access to job execution across multi‑node Slurm or Kubernetes clusters. Root Cause Automation: Develop heuristics and models based on machine learning to classify failures as 'Hardware Fault,' 'Software Bug,' or 'Environment Issue.' This reduces the Mean Time to Identify (MTTI) for R&D teams. Resiliency Engineering: Work closely with hardware and infrastructure teams to define 'signals of impending failure,' enabling proactive job migration or checkpointing before a crash occurs. What we need to see Distributed Systems Mastery: BS, MS, or PhD in Computer Science or Electrical Engineering (or equivalent experience) with 6+ years in systems programming. Experience building automated RCA pipelines for HPC or cloud‑scale environments. CPU Architecture Deep‑Dive: Expert knowledge of x86/ARM node‑level metrics: IPC, cache contention, NUMA imbalance, and hardware interrupts. Programming Proficiency: Strong C++ and Python skills, with the ability to build high‑performance daemons that monitor system health without impacting workload performance. Scale Experience: Familiarity with cluster resource managers (Slurm, LSF, or Kubernetes) and how they manage job lifecycle and signal propagation. Ways to Stand Out from the Crowd Low‑Level Diagnostics: Expert knowledge of the Linux kernel and its error‑reporting interfaces (/dev/mcelog, dmesg, journald). Understand how the kernel handles hardware exceptions and memory faults. GPU Infrastructure Proficiency: Deep experience with the NVIDIA DCGM and NVIDIA Management Library (NVML) for monitoring device health and capturing state‑dumps. Experience with tools doing non‑intrusive monitoring of application health and syscall‑level failure patterns. Experience with checkpoint/restore technologies (like CRIU) and their application in long‑running EDA flows. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering a diverse work environment and proud to be an equal‑opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. #J-18808-Ljbffr NVIDIA Gruppe

Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the Senior System Architect, Infrastructure Reliability in Santa Clara, CA vacancy
  •  ...This role requires deep expertise in Linux systems and Kubernetes as well as strong customer focus to design reliability guardrails and mentor engineers. The successful...  ...architectural decisions, and enhance AI infrastructure globally. Attractive benefits include competitive... 
    Senior

    Crusoe

    Sunnyvale, CA
    5 days ago
  • NVIDIA is seeking a System Architect to lead rack‑level and platform pathfinding for our next...  ...center teams to deliver high‑performance, reliable AI platforms. Other responsibilities...  .... Solid understanding of data center infrastructure: rack power distribution, network... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • Overview NVIDIA is looking for an experienced GPU and network systems Solutions Architect & Engineer to drive deployment of our end‑to‑end...  ...networking switches for Ethernet/Infiniband and data‑center infrastructure (power/cooling). Knowledge of DevOps/MLOps technologies... 
    Senior
    Remote work

    NVIDIA Corporation

    Santa Clara, CA
    4 days ago
  •  ...the worlds hardest problems.NVIDIA is looking for Senior Cloud Infrastructure/DevOps Solutions Architect to join its NVIDIA Infrastructure Specialist Team....  ...team building many of the largest and fastest AI/HPC systems in the world! We are looking for someone with the ability... 
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    5 days ago
  • $152k - $241.5k

     ...a large scale! We are seeking a hands‑on Solutions Architect with deep expertise in backend infrastructure, inference and cloud‑native applications to design...  ...Architecture or Infrastructure Engineering, advancing AI/ML systems from proof of concept to production on private/... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $262k - $365k

    Senior Accelerators Systems Software Architect, AI Transformation corporate_fare Google place Sunnyvale, CA, USA...  ...a strong focus on quality and reliability throughout the manufacturing and...  ...deployment life‑cycle. The AI and Infrastructure team is redefining what’s... 
    Senior
    Worldwide

    Google Inc.

    Sunnyvale, CA
    5 days ago
  •  ...team owns the cloud‑agnostic, reliable, and cost‑efficient platform...  ...(AV) efforts. We provide an infrastructure platform for teams...  ...About the Role We are seeking a Senior ML Infrastructure Engineer to...  ...hardware‑in‑the‑loop validation systems. Experience with high... 
    Senior
    Local area
    Work from home

    General Motors

    Sunnyvale, CA
    4 days ago
  • Cerebras is looking for a Senior Site Reliability Engineer to join their Infrastructure team in Palo Alto, California. This role involves designing and optimizing...  ...background in cloud-native technologies and distributed systems. The position offers the chance to work with... 
    Senior

    Cerebras

    Palo Alto, CA
    2 days ago
  • $153.2k - $234.1k

     ...breakthrough hardware and battery systems to intuitive design,...  ...a global scale. Role As a Senior ML Infra Engineer, you will...  ...easy to use, and exceptionally reliable. Your success will be measured...  ..., applications, or ML infrastructure. Experience designing robust... 
    Senior
    Local area
    Remote work
    Relocation package
    Flexible hours

    Israelvcforum

    Sunnyvale, CA
    4 days ago
  •  ...NVIDIA Gruppe in Santa Clara is seeking a Senior Architect to lead projects in AI infrastructure. You will architect and implement high-performance communication...  ...with cutting-edge technologies in networking and systems software. The ideal candidate has over 12 years... 
    Senior

    Jobleads-US

    Santa Clara, CA
    4 days ago
  • Kandou Bus SA is looking for an experienced System Architect to redefine AI infrastructure. Based in the US (Bay Area or Austin preferred), this role involves owning the SoC and Product Specification stack for their cutting-edge technology. The ideal candidate will have... 
    Senior

    Kandou Bus SA

    Santa Clara, CA
    5 days ago
  •  ...execution of power solution projects. You will play a crucial role in understanding and simulating power architectures, ensuring reliable and cost-effective solutions. The ideal candidate will possess a Master's or Ph.D. in Electrical Engineering and over 7 years of experience... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $159k - $231k

    Google Inc. is seeking a Senior Hardware Systems Design Engineer in Sunnyvale, CA. This role involves working on innovative ML/AI hardware...  ...the boundaries of technology. As a member of the Platforms Infrastructure team, you will lead the design and validation of hardware... 
    Senior

    Google Inc.

    Sunnyvale, CA
    5 days ago
  • Data Center Systems Architect NVIDIA is looking for a highly motivated Datacenter Systems Architect to join our multifaceted and innovative...  ...balance of performance, quality, scalability, re‑use, reliability, cost and time‑to‑market Document designs and plans to encourage... 
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    2 days ago
  • $184k - $287.5k

     ...and maintain our leadership. NVIDIA is seeking a motivated system architect to define future aspects of our GPU through employing pioneering...  ...workloads. Develop and enhance architecture analysis infrastructure, including performance simulators, testbench components and... 
    Senior
    Work experience placement
    Night shift

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • NVIDIA Enterprise Platforms Group is seeking a Senior System Architect to define, design, and validate enterprise AI factory reference architectures...  ...system architecture, customer requirements, and hands‑on infrastructure validation, helping turn NVIDIA accelerated computing,... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    6 days ago
  • $323k

     ...outstanding and visionary Lead Architect to drive the definition and...  ...of next-generation system architectures and technologies...  ...craft the future of compute infrastructure across data center and networking...  ...the architectural vision to senior collaborators and technical... 
    Senior
    Work at office
    Local area
    Remote work

    Arm Limited

    San Jose, CA
    2 days ago
  • $153k - $222k

    Decisive Point is hiring engineers in Sunnyvale, CA, to work on machine learning infrastructure. Responsibilities include designing GPU training approaches and building ML pipelines for product workflows. The ideal candidate should have a Bachelor's degree in Computer... 
    Senior

    Decisive Point

    Sunnyvale, CA
    6 days ago
  • $152k - $241.5k

     ...and amazing people. NVIDIA is looking for an experienced Senior Software and System Architect to join our Networking Software Architecture group....  ...solutions to complex problems Writing effective, clear and reliable architecture specifications Evaluating new technologies,... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $184k - $287.5k

     ...is looking for a Solutions Architect experienced in Arm-based server...  ...validate multi-tenant cloud infrastructure based on ARM server CPUs,...  ...solve complex scalability and reliability issues across the CPU,...  ...years in solution architecture, systems engineering, performance engineering... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $184k - $356.5k

    NVIDIA Corporation is seeking an experienced Solutions Architect & Engineer to enhance deployments of their GPU and network systems. This role requires collaboration with strategic customers, providing technical guidance, and analyzing system performance issues. Candidates... 
    Senior
    Casual work
    Remote work

    NVIDIA Corporation

    Santa Clara, CA
    5 days ago
  • Illumio is seeking a Senior Site Reliability Engineer to enhance reliability and performance in their cloud-based systems in Sunnyvale, California. The ideal candidate will have...  ...experience in managing AWS and Azure infrastructures, a strong passion for automation, and... 
    Senior

    Illumio

    Sunnyvale, CA
    5 days ago
  • A leading technology firm is in search of a Senior Wireless Network Site Reliability Engineer to manage and enhance their wireless network infrastructure. The ideal candidate has over 8 years of experience in wireless network operations and a strong background in wireless... 
    Senior

    TechDigital Group

    Santa Clara, CA
    2 days ago
  • NVIDIA Gruppe in Santa Clara is hiring for a role in their Hardware Infrastructure EDA Compute team to optimize workload scheduling systems and improve overall service reliability. The successful candidate will manage and scale job scheduling systems while driving measurable... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  •  ...join their Production Engineering team in Sunnyvale, California. In this position, you will ensure the reliability and scalability of Crusoe's cloud infrastructure and build essential tools for observability. If you have over 15 years of experience in infrastructure and... 
    Senior

    Crusoe

    Sunnyvale, CA
    5 days ago
  •  ...team owns the cloud-agnostic, reliable, and cost-efficient platform...  ...We’re proud to serve as the infrastructure platform for teams...  ...About the Role We are seeking a Senior ML Infrastructure engineer to...  ...running scalable distributed systems. They will rapidly test and... 
    Senior

    General Motors

    Mountain View, CA
    3 days ago
  • $156k - $190k

    Crusoe Energy Systems in Sunnyvale, CA, is seeking a Staff Cloud Support Engineer to provide technical leadership in cloud infrastructure. You will lead incident responses, design reliability architecture, and mentor team members. The ideal candidate will have over 8 years... 
    Senior

    Crusoe Energy Systems

    Sunnyvale, CA
    4 days ago
  • Inlighten Technologies, located in Santa Clara, is looking for a Cloud Systems Administrator to manage cloud infrastructure crucial for next-generation AR/AI products. The ideal candidate ensures system reliability and implements security practices. The position demands 8+... 
    Senior

    Inlighten Technologies

    Santa Clara, CA
    3 days ago
  • CrowdStrike is seeking a Program Manager III for the Infrastructure Foundations team in Sunnyvale. In this role, you will lead complex...  ...critical infrastructure projects like cloud migrations and reliability improvements. Applicants should have over 7 years of program... 
    Senior
    2 days per week
    3 days per week

    Dormont Manufacturing Company

    Sunnyvale, CA
    4 days ago
  • $145k - $165k

     ...looking for a highly experienced Site Reliability Engineer (SRE). This role involves maintaining uptime and performance across systems. Exceptional Linux expertise and automation...  ...include designing resilient infrastructure, monitoring environments, and responding... 
    Senior

    Bolt Graphics, Inc.

    Sunnyvale, CA
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior System Architect, Infrastructure Reliability. Be the first to apply!