Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior System Architect, Infrastructure Reliability

$184k - $287.5k

NVIDIA

Senior System Architect: Heterogeneous EDA Systems

NVIDIA is seeking a Senior System Architect to solve a complex challenge in accelerated computing: Failure Attribution at Scale. As EDA or equivalent experience workloads scale across thousands of heterogeneous nodes, a single failure can cause massive resource waste. We need an engineer to develop and build an automated framework. This framework will ingest telemetry from CPU and GPU clusters to identify the root cause of job failures in real-time. It will distinguish between hardware faults, infrastructure instability, and software defects.

What You'll Be Doing:

  • Architect Failure Attribution Frameworks: Build a scalable "flight recorder" for EDA jobs that captures high-fidelity state across the CPU, GPU, and Fabric at the moment of failure.
  • Build automated diagnostics that correlate GPU XID errors, PCIe bus failures, and CUDA memory exceptions. Connect these errors with system-level events such as OOM kills or NUMA-related hangs.
  • Distributed Logging & Tracing: Implement low-overhead tracing mechanisms (using tracing tools or custom agents) that provide access to job execution across multi-node Slurm or Kubernetes clusters.
  • Root Cause Automation: Develop heuristics and models based on machine learning to classify failures as "Hardware Fault," "Software Bug," or "Environment Issue." This reduces the Mean Time to Identify (MTTI) for R&D teams.
  • Resiliency Engineering: Work closely with hardware and infrastructure teams to define "signals of impending failure," enabling proactive job migration or check-pointing before a crash occurs.

What We Need To See:

  • Distributed Systems Mastery: BS, MS, or PhD in Computer Science or Electrical Engineering (or equivalent experience) with 6+ years in systems programming.
  • Experience building automated RCA (Root Cause Analysis) pipelines for HPC or cloud-scale environments.
  • CPU Architecture Deep-Dive: Expert knowledge of x86/ARM node-level metrics: IPC (Instructions Per Cycle), cache contention, NUMA imbalance, and hardware interrupts.
  • Programming Proficiency: Strong C++ and Python skills, with the ability to build high-performance daemons that monitor system health without impacting workload performance.
  • Scale Experience: Familiarity with cluster resource managers (Slurm, LSF, or Kubernetes) and how they manage job lifecycle and signal propagation.

Ways To Stand Out From The Crowd:

  • Low-Level Diagnostics: Expert knowledge of the Linux kernel and its error-reporting interfaces (/dev/mcelog, dmesg, journald). Understand how the kernel handles hardware exceptions and memory faults.
  • GPU Infrastructure Proficiency: Deep experience with the NVIDIA DCGM (Data Center GPU Manager) and NVIDIA Management Library (NVML) for monitoring device health and capturing state-dumps.
  • Experience with tools doing non-intrusive monitoring of application health and syscall-level failure patterns.
  • Experience with checkpoint/restore technologies (like CRIU) and their application in long-running EDA flows.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until June 19, 2026.

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Vacancy posted 2 days ago
Similar jobs that could be interesting for youBased on the Senior System Architect, Infrastructure Reliability in Austin, TX vacancy
  • Autonomai Recruitment is seeking an RF Systems Engineer to design and deploy high-performance...  ...RF hardware for low-latency trading infrastructure in Austin, Texas. This hands-on role...  ...ground up, focusing on performance, reliability, and precision. The ideal candidate will... 
    Senior

    Autonomai Recruitment

    Austin, TX
    4 days ago
  • $297.6k - $402.6k

    Responsibilities: Define SoC system and subsystem use-case performance and analysis requirements...  ...models to RTL. Partner with system architects, IP and technology teams to translate...  ..., emulation or FPGA. Deep knowledge of Infrastructure system architecture and performance... 
    Senior
    Work at office
    Local area

    ARM

    Austin, TX
    1 day ago
  • 2K Games, Inc. is seeking a Senior Systems Engineer to enhance and optimize their Perforce Helix Core infrastructure. This role involves managing server configurations, ensuring data integrity, and providing technical support across global teams. Ideal candidates will... 
    Senior

    2K Games, Inc.

    Austin, TX
    4 days ago
  • $120k

    The University of Texas at Austin is seeking a Senior Secure Research Systems Engineer to lead secure research computing initiatives. This critical...  ...with federal security regulations, managing cloud infrastructure, and administering servers to support controlled unclassified... 
    Senior

    The University of Texas at Austin

    Austin, TX
    5 days ago
  •  ...are rebuilding logistics infrastructure for the national security...  ...allied partners. We build AI systems that determine how...  ...Role We are looking for a Senior Solutions Architect to lead the design, deployment...  ...automate workflows, and improve reliability and scalability of... 
    Senior
    Work at office
    Local area

    Gallatin

    Austin, TX
    3 days ago
  • $127k - $249k

    Senior / Staff Engineer - SRE, InfraSec We are looking for an experienced Senior or Staff...  ...guide the security of our cloud‑based infrastructure. You will be highly hands‑on technically...  ..., and how they work together in complex systems. Strong communication and leadership skills... 
    Senior
    Local area
    Remote work

    The Consulting Solutions

    Austin, TX
    2 days ago
  • Apple Inc. in Austin, Texas, is seeking a Systems Architect for Retail and Marcom Engineering. You will guide the technical direction of cloud infrastructure and integrate AI-driven solutions that define the Apple brand, impacting millions of customers globally. The ideal... 
    Senior

    Apple

    Austin, TX
    5 days ago
  • $155.42k - $395.9k

     ...Platform is part of the AV ML Infrastructure organization. Our team owns the cloud-agnostic, reliable, and cost-efficient platform that...  ...the Role: We are seeking a Senior ML Infrastructure engineer to...  ...experience in designing distributed systems for ML, strong problem‑solving... 
    Senior
    Remote work
    Relocation
    Relocation package
    Flexible hours

    General Motors

    Austin, TX
    5 days ago
  • $115.4k - $251.6k

     ...About Oracle Health Applications & Infrastructure Oracle Health Applications & Infrastructure...  .... About the Role As an Senior Principal Product Manager, you will own...  ...architecture, APIs, data flows, integrations, reliability, security, and scalability considerations... 
    Senior
    Temporary work
    Flexible hours

    Oracle

    Austin, TX
    3 days ago
  • Ll Oefentherapie is looking for an Infrastructure Support Technician in Austin, Texas. This role involves supporting the design and architecture...  ...to work in a collaborative environment focused on infrastructure reliability and performance. #J-18808-Ljbffr Ll Oefentherapie
    Senior

    Ll Oefentherapie

    Austin, TX
    1 day ago
  • A tech company specializing in commerce infrastructure is seeking a Senior SRE in Austin, Texas. In this role, you will manage the reliability and scalability of their multi-cloud infrastructure, engage in AI-assisted development, and automate various workflows. Ideal... 
    Senior

    Satsuma

    Austin, TX
    4 days ago
  • Overview Are you a Senior Linux Software/System Architect who would like to have a positive impact for millions of people? If so, we may have an opportunity...  ...manage scalable, highly available, and secure cloud infrastructure in AWS and Azure. Linux administration in a cloud... 
    Senior
    Temporary work
    Work experience placement
    Remote work
    Monday to Friday
    Flexible hours

    TISTA Science and Technology Corporation

    Austin, TX
    2 days ago
  • News Corporation is seeking a Senior Site Reliability Engineer to enhance the reliability and operational excellence of our platform infrastructure. This role involves supporting robust AWS infrastructure and ensuring the reliability of critical services as part of a dynamic... 
    Senior

    News Corporation

    Austin, TX
    4 days ago
  • A technology firm focused on AI infrastructure is seeking a Senior Site Reliability Engineer in Austin, TX. The role involves ensuring the reliability of AI systems, developing automation tools, and collaborating with engineering teams. Candidates should have a Master’... 
    Senior
    Remote job

    trustwise Inc.

    Austin, TX
    5 days ago
  • $145k - $250k

     ...innovative identity solutions company located in Austin is seeking a Senior Infrastructure Engineer to develop standards and tooling. You'll work closely with engineering teams focusing on improving system reliability and efficiency. The role requires expertise in AWS,... 
    Senior
    Remote job

    SentiLink

    Austin, TX
    5 days ago
  • Prosperity Life Group in Austin, Texas is seeking a Senior Systems Engineer to enhance its infrastructure and Cloud systems. This role focuses on optimizing...  ...Windows Server environments while ensuring secure, reliable systems management. The ideal candidate will harness... 
    Senior

    Prosperity Life Group

    Austin, TX
    2 days ago
  • As a Senior Manager, you will lead a team responsible for the development...  ...fabrics and supporting systems. This role requires deep...  ...make these fabrics more reliable, observable, and efficient at...  ...for large-scale distributed infrastructure. #J-18808-Ljbffr Ll Oefentherapie
    Senior

    Ll Oefentherapie

    Austin, TX
    2 days ago
  • Apple Inc. is seeking a Front-End RTL Infrastructure - CAD Engineer to develop and support reliable infrastructures for design and verification teams. This role demands expertise in programming, particularly in Python or Perl, and knowledge of Verilog/SystemVerilog. You... 
    Senior

    Apple Inc.

    Austin, TX
    5 days ago
  • Zowta, LLC is seeking a Senior Site Reliability Engineer in Austin, TX. This full-time, hybrid role involves maintaining and improving cloud systems using AWS and Infrastructure as Code. Candidates should have over 5 years of relevant experience, strong skills in DevOps... 
    Senior
    Full time

    Zowta, LLC

    Austin, TX
    1 day ago
  •  ...infra has to match. The role We’re looking for a Senior SRE to own the reliability, scalability, and operational posture of Satsuma’s multi-cloud infrastructure. You’ll be the person who keeps things running, builds the systems that prevent fires, and makes on‑call not... 
    Senior

    Satsuma

    Austin, TX
    1 day ago
  • Zello is seeking a Senior Site Reliability Engineer based in Austin, Texas. In this role, you will manage the reliability of MySQL and MongoDB...  .... Your experience should include at least 7 years in infrastructure or database reliability, with a focus on high availability... 
    Senior
    Flexible hours

    Zello

    Austin, TX
    3 days ago
  • Indeed, Inc. is seeking a Software Engineer III to design and maintain data infrastructure for our database platform team. You will enhance reliability and simplify adoption for engineers by collaborating with site reliability engineers and application teams. The ideal... 
    Senior

    Indeed, Inc., c/o CT Corporation (Indeed.com)

    Austin, TX
    3 days ago
  •  ...Senior Cloud Solutions Architect Location: Austin, TX (Hybrid - Onsite Monday...  ...initiatives for enterprise systems. This role focuses on designing...  ...cloud application and infrastructure architectures aligned...  ...environments for performance, reliability, and cost efficiency... 
    Senior
    Contract work
    Local area
    Remote work

    3B Staffing LLC

    Austin, TX
    4 days ago
  •  ...make an impact and help shape what's next? Join us! Senior Principal Solutions Architect The Senior Principal Solutions Architect provides...  ...understanding of modern technical architecture, including cloud infrastructure and applications Proficiency in data integration... 
    Senior
    Worldwide

    Dun & Bradstreet

    Austin, TX
    4 days ago
  •  ...planning, building, and maintaining software systems that enhance global data center operations, specifically for AI infrastructure. The successful candidate will blend...  ...establishing high standards for code quality and system reliability. #J-18808-Ljbffr Ll Oefentherapie
    Senior

    Ll Oefentherapie

    Austin, TX
    5 days ago
  • $211.9k - $317.9k

    Qualcomm is seeking a CPU System and Compute Die Architect in Austin, Texas, to innovate high-performance CPU systems. The role involves collaboration with architects and software teams to design micro-architecture while focusing on performance and energy efficiency. The... 
    Senior

    Qualcomm

    Austin, TX
    3 days ago
  • Arm Limited is seeking a Fellow - AI Systems Architect to spearhead innovative engineering and business operations by leveraging AI. This role focuses on architecting large-scale autonomous AI systems that enhance semiconductor processes from design to manufacturing. The... 
    Senior

    ARM

    Austin, TX
    2 days ago
  • A leading semiconductor company is looking for a Hardware Management Architect in Austin, TX. In this role, you will focus on the conception and design of hardware elements for system management features. Your expertise in hardware management architecture will influence... 
    Senior

    Advanced Micro Devices

    Austin, TX
    3 days ago
  • A leading AI infrastructure company in Austin is looking for a Senior Product Marketing Manager to shape and communicate its story through effective technical content. This hands-on role involves collaborating with engineers to convert complex ideas into accessible materials... 
    Senior

    ClearML

    Austin, TX
    2 days ago
  • $170k - $200k

    Piper Companies is looking for a Systems Architect - Electrical Engineering based in Austin, TX. This senior-level role requires expertise in designing complex interconnect solutions and involves up to 30% travel within the U.S. The ideal candidate will have over 10 years... 
    Senior

    Piper Companies

    Austin, TX
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior System Architect, Infrastructure Reliability. Be the first to apply!