Senior Reliability Engineer, DGX Cloud

$168k - $270.25k

NVIDIA

# Senior Reliability Engineer, DGX CloudApplylocations: US, CA, Santa Clara: US, Remotetime type: Full timeposted on: Posted Todayjob requisition id: JR2019933NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.Are you passionate about building world-class reliability systems? Join NVIDIA as a Sr. Reliability Engineer, DGX Cloud, and be a pivotal part of a team that redefines operational excellence. Our team is at the forefront of redefining how DGX Cloud approaches reliability, making it an outstanding opportunity to develop strategies and drive innovation. We're looking for a seasoned engineer with experience in running large-scale systems and a deep understanding of operational practices.**What you'll be doing:*** Build org-wide reliability strategy, guiding how NVIDIA matures its operational practices in a 24/7 environment.* Stand up a rigorous SLO program, defining and maintaining high standards across teams.* Lead incident response for high severity incidents, ensuring low drama and high signal resolution.* Build and improve production code daily, enhancing our data platform and related tooling.* Implement chaos engineering, failure injection, and resilience testing to elevate our team's standard practices.* Improve standards by setting an example with your hands-on experience and leadership.**What we need to see:*** Deep, hands-on experience running large-scale production systems with a proven track record.* A detailed understanding of failure modes in large systems, including cascading dependencies and retry storms.* Strong software engineering skills with current, hands-on experience in Go, Python, or similar languages.* Proven experience in establishing and maintaining an SLO program with operational rigor.* Practical experience in reliability fields such as chaos engineering and failure injection.* The ability to influence across team boundaries through credibility and expertise.* 10+ years of industry experience with a Bachelor's or Master's degree, or equivalent experience operating systems at scale.**Ways to stand out from the crowd:*** Experience within a world-class reliability function like Google SRE or Meta production engineering.* Expertise in operating GPU, HPC, or AI training infrastructure with outstanding failure modes.* A track record of measurable reliability improvements within an organization.* Proficiency with modern observability and operational tools like Prometheus, OpenTelemetry, Grafana, PagerDuty, and Rootly.Widely considered to be one of the technology world’s most desirable employers, NVIDIA offers highly competitive salaries and a comprehensive benefits package. As you plan your future, see what we can offer to you and your family base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 168,000 USD - 270,250 USD for Level 4, and 208,000 USD - 333,500 USD for Level 5.You will also be eligible for equity and benefits.Applications for this job will be accepted at least until June 26, 2026.This posting is for an existing vacancy.NVIDIA uses AI tools in its recruiting processes.NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
J-18808-Ljbffr NVIDIA Corporation

Vacancy posted 2 days ago

Similar jobs that could be interesting for youBased on the Senior Reliability Engineer, DGX Cloud in Santa Clara, CA vacancy

DGX Cloud Senior Reliability Engineer (Remote)
NVIDIA Corporation in Santa Clara is seeking a Senior Reliability Engineer, DGX Cloud, to build and enhance reliability strategies for large-scale systems. You will lead efforts to implement SLO programs, improve operational practices, and ensure system resilience. The...
Senior
Remote job
NVIDIA Corporation
Santa Clara, CA
2 days ago
Senior AI Infra Engineer - Large-Scale DGX Cloud (Equity)
$356.5k
NVIDIA Gruppe is seeking an experienced AI infrastructure software engineer to join its DGX Cloud AI Efficiency Team in Santa Clara, California. This role focuses on developing the infrastructure for optimizing AI workloads and ensuring high availability and efficiency...
Senior
NVIDIA Gruppe
Santa Clara, CA
1 day ago
Director, Site Reliability and Software Engineering - DGX Cloud
$320k
As a Site Reliability and Software Engineering leader in the DGXC Cloud Reliability organization, you will manage the software, automation, and operations of the multi... ...infrastructure, and modern practices in the NVIDIA DGX Cloud Computing environment. Drive technical...
Suggested
NVIDIA Gruppe
Santa Clara, CA
3 days ago
Senior Software Engineer, Attestation Services - DGX Cloud
$224k - $356.5k
...lasting impact on the world. As part of the DGX Cloud organization, the Attestation Services... ...with security, silicon, and cloud engineering teams to turn embedded hardware trust mechanisms... ...and attestation standards into reliable, self-service cloud capabilities. If you...
Senior
Remote work
NVIDIA Gruppe
Santa Clara, CA
1 day ago
Senior Software Engineer, DGX Cloud Production Engineering
$184k - $287.5k
Overview NVIDIA DGX Cloud is building and operating large-scale GPU infrastructure for... ...workloads. We are looking for Senior Software Engineers to help build the automation, tooling... ...operational systems that make GPU clusters reliable, scalable, and safe to run. This role...
Senior
NVIDIA Gruppe
Santa Clara, CA
3 days ago
Senior SRE/DevOps Engineer - Cloud Reliability & Automation
donato technologies is seeking a Senior SRE / DevOps Engineer in Sunnyvale, CA. The successful candidate will focus on ensuring system reliability and scalability while automating operations across all teams. Candidates should have over 8 years of experience in DevOps,...
Senior
donato technologies
Sunnyvale, CA
2 days ago
Senior AI Infrastructure Engineer - DGX Cloud
...looking for an outstanding, passionate, and dedicated Senior AI Infrastructure Engineer to join our DGX Cloud group. This engineering role will design, build... ...ensures our GPU cloud services deliver maximum reliability and uptime. They carefully prepare and plan changes...
Senior
NVIDIA Corporation
Santa Clara, CA
1 day ago
Senior GPU Fleet & DGX Cloud Automation Architect
$320k
A leading tech company is seeking a seasoned individual to spearhead DGX Cloud strategy, focusing on GPU lifecycle and operational health. The ideal candidate will have over 15 years in technical roles, with significant experience in cloud infrastructure and leadership....
Senior
NVIDIA Corporation
Santa Clara, CA
4 days ago
Senior AI Infra Engineer - DGX Cloud & Kubernetes, Equity
$152k - $241.5k
NVIDIA Corporation is seeking a Senior AI Infrastructure Engineer in Santa Clara, California. This role involves designing, building, and maintaining large-scale production systems for AI services. Applicants should have a BS degree and at least 5 years of relevant experience...
Senior
NVIDIA Corporation
Santa Clara, CA
21 hours ago
Senior Network Engineer - DGX Cloud
$168k - $264.5k
Senior Network Engineer - Cloud Network Infrastructure NVIDIA is seeking an experienced Senior Network Engineer to develop and manage a robust cloud network infrastructure that supports NVIDIA's software development workflows and tools. The role focuses on designing, implementing...
Senior
NVIDIA Gruppe
Santa Clara, CA
5 days ago
Senior Systems Engineer, Storage - DGX Cloud
$208k - $333.5k
Systems Engineering is an engineering discipline focused on building, automating, and operating... ...systems with high efficiency, reliability, and velocity. It combines software and... ...that our internal and external facing GPU cloud services are deployed reliably, observable...
Senior
Flexible hours
Nvidia Corporation
Santa Clara, CA
2 days ago
Senior Simulation & Virtualization Engineer - DGX Platform
$152k - $241.5k
NVIDIA Corporation in Santa Clara is seeking a Sr. Software Engineer to architect a simulation platform for next-generation DGX products. The role involves enhancing simulator components and collaborating with global teams on performance improvements and bug fixes. The...
Senior
NVIDIA Corporation
Santa Clara, CA
21 hours ago
Senior Software Engineer, DGX Cloud AI Infrastructure
...advanced large language model workloads. We are looking for a Senior Software Engineer to lead the bring‑up, triage, benchmarking, analysis, and... ...ensure state‑of‑the‑art LLM workloads run efficiently and reliably at scale. You will lead deep performance and reliability investigations...
Senior
NVIDIA Corporation
Santa Clara, CA
4 days ago
Senior Cloud Reliability Engineer for AI Infra
$156k - $190k
Crusoe Energy Systems in Sunnyvale, CA, is seeking a Staff Cloud Support Engineer to provide technical leadership in cloud infrastructure. You will lead incident responses, design reliability architecture, and mentor team members. The ideal candidate will have over 8 years...
Senior
Crusoe Energy Systems
Sunnyvale, CA
1 day ago
Senior Site Reliability Engineer - Cloud AI Infrastructure
Cerebras is looking for a Senior Site Reliability Engineer to join their Infrastructure team in Palo Alto, California. This role involves designing... ...execution. Ideal candidates have a strong background in cloud-native technologies and distributed systems. The position...
Senior
Cerebras
Palo Alto, CA
4 days ago
Senior Site Reliability Engineer: Cloud, Kubernetes & CI/CD
A leading tech recruiting firm is seeking a Site Reliability Engineer to manage and optimize cloud infrastructure primarily using GCP or AWS. The role involves maintaining high availability through Kubernetes clusters and improving CI/CD pipelines with Terraform. Ideal...
Senior
Amiri Recruiting
Mountain View, CA
3 days ago
Senior Java SRE & Platform Engineer - AWS/Kubernetes
A leading technology company is looking for a Java SRE Engineer to support large-scale cloud migrations and production systems on AWS and Kubernetes... ...members and collaborating with various teams to ensure reliability. This position is onsite in the San Francisco Bay Area....
Senior
EITACIES Inc.
Santa Clara, CA
3 days ago
Senior DGX Cloud AI Infrastructure Software Engineer
Joining NVIDIA's DGX Cloud AI Efficiency Team means contributing to... ...an AI infrastructure software engineer to join our team. You'll be instrumental... ...of AI systems. As a senior DGX Cloud AI Infrastructure... ...Define meaningful and actionable reliability metrics to track and improve...
Senior
NVIDIA Gruppe
Santa Clara, CA
1 day ago
Senior Storage Production Engineer - DGX Cloud
$176k - $276k
Production engineering is a field that involves crafting, building, and maintaining large-scale... ...and deployment, along with open-source cloud-enabling technologies such as Kubernetes... ...ensuring storage architectures are reliable, scalable, and efficient. They optimize...
Senior
Flexible hours
NVIDIA Gruppe
Santa Clara, CA
2 days ago
Senior Site Reliability Engineer - Observability and Telemetry Platform
$176k - $276k
Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain large scale production systems with high... ...management, continuous delivery and deployment and open source cloud enabling technologies like Kubernetes and OpenStack. SRE at...
Senior
NVIDIA Corporation
Santa Clara, CA
4 days ago
Senior Software Engineer - DGX Cloud Services and Software
$168k - $270.25k
The NVIDIA DSX organization is looking for software engineering talent to build NVIDIA’s NICo technology. This software assists in the rapid... ...and support end-to-end software solutions to manage complex cloud infrastructure deployments. You will write services and software...
Senior
Full time
NVIDIA
Santa Clara, CA
3 days ago
Senior Technical Program Manager, DGX Cloud Software Products and Services
$168k - $258.75k
...Senior Technical Program Manager, DGX Cloud Software Products and Services page is loaded## Senior Technical... ...programs emphasizing resilience, reliability, and goodput. This role requires... ...research. You will work closely with engineering, SRE, operations, and researchers...
Senior
NVIDIA
Santa Clara, CA
2 days ago
Senior Technical Program Manager - DGX Cloud Storage
$200k - $322k
NVIDIA’s DGX Cloud is redefining how organizations deploy and scale AI infrastructure. We’re looking for a Senior Technical Program Manager to drive storage‑related initiatives across... ...a high‑impact role interfacing with engineering, product, operations, finance, and our...
Senior
NVIDIA Gruppe
Santa Clara, CA
1 day ago
Principal Software Engineer, DGX Cloud Production Engineering
$272k - $431.25k
NVIDIA DGX Cloud is scaling GPU infrastructure across internal, partner... ...for Principal Software Engineers to help shape the technical direction... ...operations, automation, and reliability across large‑scale GPU clusters. This role is for senior technical leaders who can define...
NVIDIA Gruppe
Santa Clara, CA
3 days ago
Senior Reliability Engineer
$120k - $171k
...Photonic Devices Engineer This position is cross-functional in nature and requires close cooperation with the design, development... ...photonic devices and have a strong desire to learn about the reliability challenges associated with new product development. Your Responsibilities...
Senior
Full time
Temporary work
Nokia
Sunnyvale, CA
1 day ago
Senior Reliability Engineer
$116k - $184k
...profoundly impacting society. Come join the team and help build the next era of computing! We're seeking an outstanding Senior HTOL Reliability Engineer to join our Santa Clara lab. This role requires deep device-circuitry knowledge and hands-on hardware development. You...
Senior
NVIDIA
Santa Clara, CA
2 days ago
Senior Reliability Engineer
$119.8k - $234.7k
...Overview Microsoft Silicon, Cloud Hardware, and Infrastructure Engineering (SCHIE) is the team behind Microsoft... .... We are looking for a Senior Quality Engineer to join the team... ...Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering...
Senior
Ongoing contract
Work experience placement
Work at office
Local area
Worldwide
Microsoft Corporation
Santa Clara, CA
1 day ago
Senior Technical Program Manager - DGX Cloud Infra Security
$200k - $322k
Senior Technical Program Manager - DGX Cloud Infra Security page is loaded## Senior Technical Program Manager - DGX Cloud Infra Securitylocations: US,... ...cross-functional teams in Security, Compliance, SRE, and Engineering to continually advance and strengthen the DGX Cloud...
Senior
NVIDIA Corporation
Santa Clara, CA
1 day ago
Senior IC Packaging Reliability Engineer - 2.5D/3D & BGA
NVIDIA Gruppe is seeking an experienced professional to lead package-level reliability for semiconductor products in Santa Clara, California. The ideal candidate will possess a Master’s or PhD in a related field, along with 8+ years of hands-on experience in IC packaging...
Senior
NVIDIA Gruppe
Santa Clara, CA
1 day ago
Senior Reliability Engineer - LPU Packaging
What You’ll Be Doing Own the package‑level reliability spec for assigned products Define... ...What We Need to See MS/PhD in Electrical Engineering, Materials Science, Mechanical Engineering... ...operations Experience with data center or cloud hardware and understanding of rack and...
Senior
NVIDIA Gruppe
Santa Clara, CA
1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Reliability Engineer, DGX Cloud. Be the first to apply!