Senior System Architect, Infrastructure Reliability
$184k - $287.5kNVIDIA
NVIDIA is seeking a Senior System Architect: Heterogeneous EDA Systems to solve a complex challenge in accelerated computing: Failure Attribution at Scale. As EDA or equivalent experience workloads scale across thousands of heterogeneous nodes, a single failure can cause massive resource waste. We need an engineer to develop and build an automated framework. This framework will ingest telemetry from CPU and GPU clusters to identify the root cause of job failures in real-time. It will distinguish between hardware faults, infrastructure instability, and software defects. What you'll be doing: Architect Failure Attribution Frameworks: Build a scalable "flight recorder" for EDA jobs that captures high-fidelity state across the CPU, GPU, and Fabric at the moment of failure. Build automated diagnostics that correlate GPU XID errors, PCIe bus failures, and CUDA memory exceptions. Connect these errors with system-level events such as OOM kills or NUMA-related hangs. Distributed Logging & Tracing: Implement low-overhead tracing mechanisms (using tracing tools or custom agents) that provide access to job execution across multi-node Slurm or Kubernetes clusters. Root Cause Automation: Develop heuristics and models based on machine learning to classify failures as "Hardware Fault," "Software Bug," or "Environment Issue." This reduces the Mean Time to Identify (MTTI) for R&D teams. Resiliency Engineering: Work closely with hardware and infrastructure teams to define "signals of impending failure," enabling proactive job migration or check-pointing before a crash occurs. What we need to see: Distributed Systems Mastery: BS, MS, or PhD in Computer Science or Electrical Engineering (or equivalent experience) with 6+ years in systems programming. Experience building automated RCA (Root Cause Analysis) pipelines for HPC or cloud-scale environments. CPU Architecture Deep-Dive: Expert knowledge of x86/ARM node-level metrics: IPC (Instructions Per Cycle), cache contention, NUMA imbalance, and hardware interrupts. Programming Proficiency: Strong C++ and Python skills, with the ability to build high-performance daemons that monitor system health without impacting workload performance. Scale Experience: Familiarity with cluster resource managers (Slurm, LSF, or Kubernetes) and how they manage job lifecycle and signal propagation. Ways To Stand Out From The Crowd: Low-Level Diagnostics: Expert knowledge of the Linux kernel and its error-reporting interfaces (/dev/mcelog, dmesg, journald). Understand how the kernel handles hardware exceptions and memory faults. GPU Infrastructure Proficiency: Deep experience with the NVIDIA DCGM (Data Center GPU Manager) and NVIDIA Management Library (NVML) for monitoring device health and capturing state-dumps. Experience with tools doing non-intrusive monitoring of application health and syscall-level failure patterns. Experience with checkpoint/restore technologies (like CRIU) and their application in long-running EDA flows. #LI-Hybrid Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until June 19, 2026. This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. NVIDIA pioneered accelerated computing. Today, our AI infrastructure powers global intelligence, transforming every industry. Learn more about NVIDIA.
- ...team owns the cloud-agnostic, reliable, and cost-efficient platform... ...We’re proud to serve as the infrastructure platform for teams... ...Role: We are seeking a Senior ML Infrastructure engineer... ...running scalable distributed systems. They will rapidly test and...SeniorLocal areaWork from home
$160k - $200k
...join its fast-growing teams. As a Senior ML Infrastructure Engineer at Plus, you will design scalable... ...for managing model versioning systems and experiment tracking frameworks, which... .... Ensure high availability and reliability of the ML platform by implementing robust...Senior$153.2k - $234.1k
...breakthrough hardware and battery systems to intuitive design,... ...solutions that support safe and reliable autonomous vehicle behavior... ...-world scenarios. As a Senior ML Infra Engineer, you will... ...systems, applications, or ML infrastructure. ~ Experience designing robust...SeniorLocal areaRemote workWork from homeRelocation packageFlexible hours$153.2k - $234.1k
...breakthrough hardware and battery systems to intuitive design,... ...where we build the critical infrastructure that powers every machine learning... ...to use, and exceptionally reliable. Your success will be... ...driverless vehicles. As a Senior ML Infra Engineer, you will...SeniorWork at officeLocal areaRemote workWork from homeRelocationRelocation packageFlexible hours- ...Power Systems Architect, Quantum Infrastructure PsiQuantum's mission is to build the first useful quantum computers—machines capable of delivering... ...radiated EMI mitigation, noise suppression, and system reliability. This person will help shape the electrical foundation...SuggestedShift work
- ...TITLE: ML Data Infrastructure Engineer LOCATION: Sunnyvale CA or Remote Duration: 12+ Months Rate: DOE Key skills - GCP ML... ...our ML data infrastructure platform . You'll create the systems and tools that enable efficient data preparation, feature...SeniorRemote work
$175k - $290k
...Senior Software Infrastructure Engineer Santa Clara, CA This role is part of the Software Infrastructure... ...enable development of ML accelerator systems across both hardware and software... ...and hardware teams to ensure reliable, scalable, and efficient development...SeniorRemote work$170k - $240k
...driven expert in ML Training Infrastructure with a strong ability to execute... ...and building scalable, reliable, and high-performance AI/ML platform... ...initiatives. As a Senior ML Engineer, you will collaborate... ...save cost. Raise the bar on system observability, debuggability,...SeniorLocal areaRemote workWork from homeRelocationRelocation packageFlexible hours- ...Senior Principal AI System Architect A System Architect focusing on advanced memory technology and AI Inference solution infrastructure will lead the definition of system level solutions with emphasis on compute memory bottlenecks, bandwidth and latency management....Senior
$150k
...The Role We're looking for a distributed ML infrastructure engineer to help extend and scale our training systems. You’ll work side‑by‑side with world‑class researchers... ...external visibility Improve training system reliability, maintainability, and performance While much...Flexible hours$153.82k - $269.55k
...Senior Electrical Systems Architect We are seeking a Senior Electrical Systems Architect to define and lead the platform-level electronics architecture... ...Lead architecture tradeoffs balancing performance, reliability, serviceability, cost, and schedule. Drive...SeniorFull timeLocal areaShift work3 days per week$207k - $275k
...Senior Manager, Technical Solutions Manager Sunnyvale, CA CoreWeave... ...CoreWeave combines superior infrastructure performance with deep... ...workloads at scale has a seamless, reliable, and high-performance... ...across data centers, hardware systems, and customer workloads to...SeniorPermanent employmentTemporary workCasual workWork at officeFlexible hours- ...Cerebras Systems builds the world's largest AI chip, 56 times larger... ...ensure Cerebras systems are reliably deployed, operated, and... ...Systems Engineering AI Cloud Infrastructure & Operations Network &... ...metrics, and operational risks to senior leadership Required...Senior
$166.52k - $249.5k
...Principal System Architect Marvell's semiconductor solutions are the essential building blocks of the data infrastructure that connects our world. Across enterprise, cloud and AI, and... ...improvement. Writing effective, clear and reliable architecture specifications...Permanent employmentWork experience placementInternshipWork from home- ...the Flash memory it needs to keep our world moving forward. Job Description A System Architect focusing on advanced memory technology and AI Inference solution infrastructure will lead the definition of system level solutions with emphasis on compute memory bottlenecks...SeniorTemporary workRemote workFlexible hoursShift work
$198.7k - $298.1k
...Engineering General Summary: We are seeking an experienced CPU System Architect to join our team. If you possess a deep understanding of CPU... ...meet the needs of stakeholders, including engineering teams, senior management and internal partners. Develop High-Level...SeniorWork experience placementWork from home$149.1k - $215.93k
...About the Role We are looking for a Senior MLOps & AI Infrastructure Engineer to architect, build, and operationalize machine learning systems at scale. This role sits at the... ...MLOps discipline required to ship models reliably into production. You will partner closely...SeniorLocal areaShift work$190k - $220k
...workstations, smartphones, tablets), infrastructure (server, storage, edge, high performance... .... We are hiring a CrossDevice OS System Architect to define the endtoend architecture of... ...highimpact individual contributor role for a senior technologist responsible for shaping...SeniorFull timeLocal area- ...Networks, Secure Cloud and AI infrastructure is the foundation of our... ...organization. You will be the key architect of our strategy to embed... ...and implementation of novel systems that leverage Large Language... ...cloud platforms, mentoring senior engineers and infusing...SeniorFull timeWork at office3 days per week
$160k - $210k
...deliver faster, cheaper, more reliable connectivity. Lead the... ...About the Role As a Senior Backend Software Engineer, Cloud & Infrastructure at Taara , you will be the architect of the "brains" behind our... ...and scale the distributed systems, APIs, and cloud-native infrastructure...SeniorFull timeWork at officeNight shift3 days per week$165k - $242k
...Senior Software Engineer, Data Center Infrastructure Tooling CoreWeave is The Essential Cloud for AI™. Built for... ...Integrations with internal/external systems and data sources that feed... ...CD pipelines, observability, and reliability practices. What We're Looking...Senior- ...Senior Manager, AI Infrastructure Network Operations The OCI AI Infrastructure Network Operations team... ...RDMA network fabrics and supporting systems. This role requires deep networking... ...systems that make these fabrics more reliable, observable, and efficient at global...SeniorTemporary workFlexible hoursNight shift
$225k - $325k
...Senior Machine Learning Engineer ABOUT THE ROLE This is... ...latency-sensitive, high-traffic systems. You’ll own model... ...and ensure they stay fast, reliable, and accurate at scale. Run... ...model iterations. Level Up Infrastructure – Design and maintain the ML...SeniorH1b- ...Senior Principal Ai Agent / Ml Software Engineer The... ...operating next-generation AI systems on Oracle Cloud Infrastructure (OCI). This person will... ...ship, scale, and operate reliable, secure, observable, and... ...observability. Design, architect, and deliver scalable agentic...Senior
$160.98k - $227.27k
...generations. We are seeking a Senior Infrastructure and DevOps Engineer to... ...cross-functionally with architects, modeling engineers, and software... ...run efficiently, reliably, and at scale across Linux-... ...maintain automation for build systems, toolchains, packaging, and...SeniorInternshipLocal areaImmediate startShift work- ...Sr. Solutions Architect Architect Our business is undergoing transformation... ...us in moving the monolithic systems to modern cloud-based... ...for large or complex cloud infrastructure solutions. In instances... ...performance, scalability, reliability, and security needs. Be a...SeniorShift work
- OCI Network Availability is seeking a Senior Manager to lead a Networking Reliability Engineering team responsible for driving operational excellence... ...at scale. Only Oracle brings together the data, infrastructure, applications, and expertise to power everything from...SeniorFull timeFlexible hours
$188.3k - $269.28k
...technology, have kept systems in sync, but they... ...lower power, and better reliability. With more than 4 billion... ...Networking System Architect serves as the... ...applications. In this senior technical leadership... ..., datacenter, and AI infrastructure communities. Serve...$220.2k - $330.4k
...Group, Engineering Group Systems Engineering General Summary... ...are hiring a Rack Systems Architect to define and deliver rack-scale... ...of silicon, systems, and infrastructure, you will drive architecture... ...performance, scalability, reliability, and total cost of ownership...Work experience placementWork from home$160k - $188.23k
...We are searching for a Senior Member of Technical Staff, Software... ...) to join the Aviatrix Test Infrastructure team. In this role, you will... ...infrastructure, ensuring reliability, scalability, and reproducibility... ...simulators for distributed systems Understanding of...SeniorFull timeTemporary workLocal areaRemote workFlexible hoursDay shift
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior System Architect, Infrastructure Reliability. Be the first to apply!
- technical architect Santa Clara, CA
- pega system architect Santa Clara, CA
- system architect Santa Clara, CA
- senior development engineer Santa Clara, CA
- senior manager quality engineering Santa Clara, CA
- senior software test automation engineer Santa Clara, CA
- senior design technologist Santa Clara, CA
- senior director corporate development Santa Clara, CA
- senior design verification engineer Santa Clara, CA
- senior director quality Santa Clara, CA




