Senior System Architect, Infrastructure Reliability
$184k - $287.5kNVIDIA
Senior System Architect: Heterogeneous EDA Systems
NVIDIA is seeking a Senior System Architect to solve a complex challenge in accelerated computing: Failure Attribution at Scale. As EDA or equivalent experience workloads scale across thousands of heterogeneous nodes, a single failure can cause massive resource waste. We need an engineer to develop and build an automated framework. This framework will ingest telemetry from CPU and GPU clusters to identify the root cause of job failures in real-time. It will distinguish between hardware faults, infrastructure instability, and software defects.
What You'll Be Doing:
- Architect Failure Attribution Frameworks: Build a scalable "flight recorder" for EDA jobs that captures high-fidelity state across the CPU, GPU, and Fabric at the moment of failure.
- Build automated diagnostics that correlate GPU XID errors, PCIe bus failures, and CUDA memory exceptions. Connect these errors with system-level events such as OOM kills or NUMA-related hangs.
- Distributed Logging & Tracing: Implement low-overhead tracing mechanisms (using tracing tools or custom agents) that provide access to job execution across multi-node Slurm or Kubernetes clusters.
- Root Cause Automation: Develop heuristics and models based on machine learning to classify failures as "Hardware Fault," "Software Bug," or "Environment Issue." This reduces the Mean Time to Identify (MTTI) for R&D teams.
- Resiliency Engineering: Work closely with hardware and infrastructure teams to define "signals of impending failure," enabling proactive job migration or check-pointing before a crash occurs.
What We Need To See:
- Distributed Systems Mastery: BS, MS, or PhD in Computer Science or Electrical Engineering (or equivalent experience) with 6+ years in systems programming.
- Experience building automated RCA (Root Cause Analysis) pipelines for HPC or cloud-scale environments.
- CPU Architecture Deep-Dive: Expert knowledge of x86/ARM node-level metrics: IPC (Instructions Per Cycle), cache contention, NUMA imbalance, and hardware interrupts.
- Programming Proficiency: Strong C++ and Python skills, with the ability to build high-performance daemons that monitor system health without impacting workload performance.
- Scale Experience: Familiarity with cluster resource managers (Slurm, LSF, or Kubernetes) and how they manage job lifecycle and signal propagation.
Ways To Stand Out From The Crowd:
- Low-Level Diagnostics: Expert knowledge of the Linux kernel and its error-reporting interfaces (/dev/mcelog, dmesg, journald). Understand how the kernel handles hardware exceptions and memory faults.
- GPU Infrastructure Proficiency: Deep experience with the NVIDIA DCGM (Data Center GPU Manager) and NVIDIA Management Library (NVML) for monitoring device health and capturing state-dumps.
- Experience with tools doing non-intrusive monitoring of application health and syscall-level failure patterns.
- Experience with checkpoint/restore technologies (like CRIU) and their application in long-running EDA flows.
Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits.
Applications for this job will be accepted at least until June 19, 2026.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
- Autonomai Recruitment is seeking an RF Systems Engineer to design and deploy high-performance... ...RF hardware for low-latency trading infrastructure in Austin, Texas. This hands-on role... ...ground up, focusing on performance, reliability, and precision. The ideal candidate will...Senior
$297.6k - $402.6k
Responsibilities: Define SoC system and subsystem use-case performance and analysis requirements... ...models to RTL. Partner with system architects, IP and technology teams to translate... ..., emulation or FPGA. Deep knowledge of Infrastructure system architecture and performance...SeniorWork at officeLocal area- 2K Games, Inc. is seeking a Senior Systems Engineer to enhance and optimize their Perforce Helix Core infrastructure. This role involves managing server configurations, ensuring data integrity, and providing technical support across global teams. Ideal candidates will...Senior
$120k
The University of Texas at Austin is seeking a Senior Secure Research Systems Engineer to lead secure research computing initiatives. This critical... ...with federal security regulations, managing cloud infrastructure, and administering servers to support controlled unclassified...Senior- ...are rebuilding logistics infrastructure for the national security... ...allied partners. We build AI systems that determine how... ...Role We are looking for a Senior Solutions Architect to lead the design, deployment... ...automate workflows, and improve reliability and scalability of...SeniorWork at officeLocal area
$127k - $249k
Senior / Staff Engineer - SRE, InfraSec We are looking for an experienced Senior or Staff... ...guide the security of our cloud‑based infrastructure. You will be highly hands‑on technically... ..., and how they work together in complex systems. Strong communication and leadership skills...SeniorLocal areaRemote work- Apple Inc. in Austin, Texas, is seeking a Systems Architect for Retail and Marcom Engineering. You will guide the technical direction of cloud infrastructure and integrate AI-driven solutions that define the Apple brand, impacting millions of customers globally. The ideal...Senior
$155.42k - $395.9k
...Platform is part of the AV ML Infrastructure organization. Our team owns the cloud-agnostic, reliable, and cost-efficient platform that... ...the Role: We are seeking a Senior ML Infrastructure engineer to... ...experience in designing distributed systems for ML, strong problem‑solving...SeniorRemote workRelocationRelocation packageFlexible hours$115.4k - $251.6k
...About Oracle Health Applications & Infrastructure Oracle Health Applications & Infrastructure... .... About the Role As an Senior Principal Product Manager, you will own... ...architecture, APIs, data flows, integrations, reliability, security, and scalability considerations...SeniorTemporary workFlexible hours- Ll Oefentherapie is looking for an Infrastructure Support Technician in Austin, Texas. This role involves supporting the design and architecture... ...to work in a collaborative environment focused on infrastructure reliability and performance. #J-18808-Ljbffr Ll OefentherapieSenior
- A tech company specializing in commerce infrastructure is seeking a Senior SRE in Austin, Texas. In this role, you will manage the reliability and scalability of their multi-cloud infrastructure, engage in AI-assisted development, and automate various workflows. Ideal...Senior
- Overview Are you a Senior Linux Software/System Architect who would like to have a positive impact for millions of people? If so, we may have an opportunity... ...manage scalable, highly available, and secure cloud infrastructure in AWS and Azure. Linux administration in a cloud...SeniorTemporary workWork experience placementRemote workMonday to FridayFlexible hours
- News Corporation is seeking a Senior Site Reliability Engineer to enhance the reliability and operational excellence of our platform infrastructure. This role involves supporting robust AWS infrastructure and ensuring the reliability of critical services as part of a dynamic...Senior
- A technology firm focused on AI infrastructure is seeking a Senior Site Reliability Engineer in Austin, TX. The role involves ensuring the reliability of AI systems, developing automation tools, and collaborating with engineering teams. Candidates should have a Master’...SeniorRemote job
$145k - $250k
...innovative identity solutions company located in Austin is seeking a Senior Infrastructure Engineer to develop standards and tooling. You'll work closely with engineering teams focusing on improving system reliability and efficiency. The role requires expertise in AWS,...SeniorRemote job- Prosperity Life Group in Austin, Texas is seeking a Senior Systems Engineer to enhance its infrastructure and Cloud systems. This role focuses on optimizing... ...Windows Server environments while ensuring secure, reliable systems management. The ideal candidate will harness...Senior
- As a Senior Manager, you will lead a team responsible for the development... ...fabrics and supporting systems. This role requires deep... ...make these fabrics more reliable, observable, and efficient at... ...for large-scale distributed infrastructure. #J-18808-Ljbffr Ll OefentherapieSenior
- Apple Inc. is seeking a Front-End RTL Infrastructure - CAD Engineer to develop and support reliable infrastructures for design and verification teams. This role demands expertise in programming, particularly in Python or Perl, and knowledge of Verilog/SystemVerilog. You...Senior
- Zowta, LLC is seeking a Senior Site Reliability Engineer in Austin, TX. This full-time, hybrid role involves maintaining and improving cloud systems using AWS and Infrastructure as Code. Candidates should have over 5 years of relevant experience, strong skills in DevOps...SeniorFull time
- ...infra has to match. The role We’re looking for a Senior SRE to own the reliability, scalability, and operational posture of Satsuma’s multi-cloud infrastructure. You’ll be the person who keeps things running, builds the systems that prevent fires, and makes on‑call not...Senior
- Zello is seeking a Senior Site Reliability Engineer based in Austin, Texas. In this role, you will manage the reliability of MySQL and MongoDB... .... Your experience should include at least 7 years in infrastructure or database reliability, with a focus on high availability...SeniorFlexible hours
- Indeed, Inc. is seeking a Software Engineer III to design and maintain data infrastructure for our database platform team. You will enhance reliability and simplify adoption for engineers by collaborating with site reliability engineers and application teams. The ideal...Senior
- ...Senior Cloud Solutions Architect Location: Austin, TX (Hybrid - Onsite Monday... ...initiatives for enterprise systems. This role focuses on designing... ...cloud application and infrastructure architectures aligned... ...environments for performance, reliability, and cost efficiency...SeniorContract workLocal areaRemote work
- ...make an impact and help shape what's next? Join us! Senior Principal Solutions Architect The Senior Principal Solutions Architect provides... ...understanding of modern technical architecture, including cloud infrastructure and applications Proficiency in data integration...SeniorWorldwide
- ...planning, building, and maintaining software systems that enhance global data center operations, specifically for AI infrastructure. The successful candidate will blend... ...establishing high standards for code quality and system reliability. #J-18808-Ljbffr Ll OefentherapieSenior
$211.9k - $317.9k
Qualcomm is seeking a CPU System and Compute Die Architect in Austin, Texas, to innovate high-performance CPU systems. The role involves collaboration with architects and software teams to design micro-architecture while focusing on performance and energy efficiency. The...Senior- Arm Limited is seeking a Fellow - AI Systems Architect to spearhead innovative engineering and business operations by leveraging AI. This role focuses on architecting large-scale autonomous AI systems that enhance semiconductor processes from design to manufacturing. The...Senior
- A leading semiconductor company is looking for a Hardware Management Architect in Austin, TX. In this role, you will focus on the conception and design of hardware elements for system management features. Your expertise in hardware management architecture will influence...Senior
- A leading AI infrastructure company in Austin is looking for a Senior Product Marketing Manager to shape and communicate its story through effective technical content. This hands-on role involves collaborating with engineers to convert complex ideas into accessible materials...Senior
$170k - $200k
Piper Companies is looking for a Systems Architect - Electrical Engineering based in Austin, TX. This senior-level role requires expertise in designing complex interconnect solutions and involves up to 30% travel within the U.S. The ideal candidate will have over 10 years...Senior
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior System Architect, Infrastructure Reliability. Be the first to apply!
- technical architect Austin, TX
- pega system architect Austin, TX
- system architect Austin, TX
- senior data management analyst Austin, TX
- senior app developer Austin, TX
- senior game producer Austin, TX
- senior retail sales associate Austin, TX
- senior manager quality engineering Austin, TX
- senior software test automation engineer Austin, TX
- senior quantitative risk analyst Austin, TX

