Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior Reliability Engineer, DGX Cloud

$168k - $270.25k

NVIDIA

  • # Senior Reliability Engineer, DGX CloudApplylocations: US, CA, Santa Clara: US, Remotetime type: Full timeposted on: Posted Todayjob requisition id: JR2019933NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.Are you passionate about building world-class reliability systems? Join NVIDIA as a Sr. Reliability Engineer, DGX Cloud, and be a pivotal part of a team that redefines operational excellence. Our team is at the forefront of redefining how DGX Cloud approaches reliability, making it an outstanding opportunity to develop strategies and drive innovation. We're looking for a seasoned engineer with experience in running large-scale systems and a deep understanding of operational practices.**What you'll be doing:*** Build org-wide reliability strategy, guiding how NVIDIA matures its operational practices in a 24/7 environment.* Stand up a rigorous SLO program, defining and maintaining high standards across teams.* Lead incident response for high severity incidents, ensuring low drama and high signal resolution.* Build and improve production code daily, enhancing our data platform and related tooling.* Implement chaos engineering, failure injection, and resilience testing to elevate our team's standard practices.* Improve standards by setting an example with your hands-on experience and leadership.**What we need to see:*** Deep, hands-on experience running large-scale production systems with a proven track record.* A detailed understanding of failure modes in large systems, including cascading dependencies and retry storms.* Strong software engineering skills with current, hands-on experience in Go, Python, or similar languages.* Proven experience in establishing and maintaining an SLO program with operational rigor.* Practical experience in reliability fields such as chaos engineering and failure injection.* The ability to influence across team boundaries through credibility and expertise.* 10+ years of industry experience with a Bachelor's or Master's degree, or equivalent experience operating systems at scale.**Ways to stand out from the crowd:*** Experience within a world-class reliability function like Google SRE or Meta production engineering.* Expertise in operating GPU, HPC, or AI training infrastructure with outstanding failure modes.* A track record of measurable reliability improvements within an organization.* Proficiency with modern observability and operational tools like Prometheus, OpenTelemetry, Grafana, PagerDuty, and Rootly.Widely considered to be one of the technology world’s most desirable employers, NVIDIA offers highly competitive salaries and a comprehensive benefits package. As you plan your future, see what we can offer to you and your family base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 168,000 USD - 270,250 USD for Level 4, and 208,000 USD - 333,500 USD for Level 5.You will also be eligible for equity and benefits.Applications for this job will be accepted at least until June 26, 2026.This posting is for an existing vacancy.NVIDIA uses AI tools in its recruiting processes.NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
  • J-18808-Ljbffr NVIDIA Corporation

Vacancy posted 2 days ago
Similar jobs that could be interesting for youBased on the Senior Reliability Engineer, DGX Cloud in Santa Clara, CA vacancy
  • NVIDIA Corporation in Santa Clara is seeking a Senior Reliability Engineer, DGX Cloud, to build and enhance reliability strategies for large-scale systems. You will lead efforts to implement SLO programs, improve operational practices, and ensure system resilience. The... 
    Senior
    Remote job

    NVIDIA Corporation

    Santa Clara, CA
    2 days ago
  • $356.5k

    NVIDIA Gruppe is seeking an experienced AI infrastructure software engineer to join its DGX Cloud AI Efficiency Team in Santa Clara, California. This role focuses on developing the infrastructure for optimizing AI workloads and ensuring high availability and efficiency... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • $320k

    As a Site Reliability and Software Engineering leader in the DGXC Cloud Reliability organization, you will manage the software, automation, and operations of the multi...  ...infrastructure, and modern practices in the NVIDIA DGX Cloud Computing environment. Drive technical... 
    Suggested

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • $224k - $356.5k

     ...lasting impact on the world. As part of the DGX Cloud organization, the Attestation Services...  ...with security, silicon, and cloud engineering teams to turn embedded hardware trust mechanisms...  ...and attestation standards into reliable, self-service cloud capabilities. If you... 
    Senior
    Remote work

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • $184k - $287.5k

    Overview NVIDIA DGX Cloud is building and operating large-scale GPU infrastructure for...  ...workloads. We are looking for Senior Software Engineers to help build the automation, tooling...  ...operational systems that make GPU clusters reliable, scalable, and safe to run. This role... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • donato technologies is seeking a Senior SRE / DevOps Engineer in Sunnyvale, CA. The successful candidate will focus on ensuring system reliability and scalability while automating operations across all teams. Candidates should have over 8 years of experience in DevOps,... 
    Senior

    donato technologies

    Sunnyvale, CA
    2 days ago
  •  ...looking for an outstanding, passionate, and dedicated Senior AI Infrastructure Engineer to join our DGX Cloud group. This engineering role will design, build...  ...ensures our GPU cloud services deliver maximum reliability and uptime. They carefully prepare and plan changes... 
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    1 day ago
  • $320k

    A leading tech company is seeking a seasoned individual to spearhead DGX Cloud strategy, focusing on GPU lifecycle and operational health. The ideal candidate will have over 15 years in technical roles, with significant experience in cloud infrastructure and leadership.... 
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    4 days ago
  • $152k - $241.5k

    NVIDIA Corporation is seeking a Senior AI Infrastructure Engineer in Santa Clara, California. This role involves designing, building, and maintaining large-scale production systems for AI services. Applicants should have a BS degree and at least 5 years of relevant experience... 
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    21 hours ago
  • $168k - $264.5k

    Senior Network Engineer - Cloud Network Infrastructure NVIDIA is seeking an experienced Senior Network Engineer to develop and manage a robust cloud network infrastructure that supports NVIDIA's software development workflows and tools. The role focuses on designing, implementing... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    5 days ago
  • $208k - $333.5k

    Systems Engineering is an engineering discipline focused on building, automating, and operating...  ...systems with high efficiency, reliability, and velocity. It combines software and...  ...that our internal and external facing GPU cloud services are deployed reliably, observable... 
    Senior
    Flexible hours

    Nvidia Corporation

    Santa Clara, CA
    2 days ago
  • $152k - $241.5k

    NVIDIA Corporation in Santa Clara is seeking a Sr. Software Engineer to architect a simulation platform for next-generation DGX products. The role involves enhancing simulator components and collaborating with global teams on performance improvements and bug fixes. The... 
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    21 hours ago
  •  ...advanced large language model workloads. We are looking for a Senior Software Engineer to lead the bring‑up, triage, benchmarking, analysis, and...  ...ensure state‑of‑the‑art LLM workloads run efficiently and reliably at scale. You will lead deep performance and reliability investigations... 
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    4 days ago
  • $156k - $190k

    Crusoe Energy Systems in Sunnyvale, CA, is seeking a Staff Cloud Support Engineer to provide technical leadership in cloud infrastructure. You will lead incident responses, design reliability architecture, and mentor team members. The ideal candidate will have over 8 years... 
    Senior

    Crusoe Energy Systems

    Sunnyvale, CA
    1 day ago
  • Cerebras is looking for a Senior Site Reliability Engineer to join their Infrastructure team in Palo Alto, California. This role involves designing...  ...execution. Ideal candidates have a strong background in cloud-native technologies and distributed systems. The position... 
    Senior

    Cerebras

    Palo Alto, CA
    4 days ago
  • A leading tech recruiting firm is seeking a Site Reliability Engineer to manage and optimize cloud infrastructure primarily using GCP or AWS. The role involves maintaining high availability through Kubernetes clusters and improving CI/CD pipelines with Terraform. Ideal... 
    Senior

    Amiri Recruiting

    Mountain View, CA
    3 days ago
  • A leading technology company is looking for a Java SRE Engineer to support large-scale cloud migrations and production systems on AWS and Kubernetes...  ...members and collaborating with various teams to ensure reliability. This position is onsite in the San Francisco Bay Area.... 
    Senior

    EITACIES Inc.

    Santa Clara, CA
    3 days ago
  • Joining NVIDIA's DGX Cloud AI Efficiency Team means contributing to...  ...an AI infrastructure software engineer to join our team. You'll be instrumental...  ...of AI systems. As a senior DGX Cloud AI Infrastructure...  ...Define meaningful and actionable reliability metrics to track and improve... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • $176k - $276k

    Production engineering is a field that involves crafting, building, and maintaining large-scale...  ...and deployment, along with open-source cloud-enabling technologies such as Kubernetes...  ...ensuring storage architectures are reliable, scalable, and efficient. They optimize... 
    Senior
    Flexible hours

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • $176k - $276k

    Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain large scale production systems with high...  ...management, continuous delivery and deployment and open source cloud enabling technologies like Kubernetes and OpenStack. SRE at... 
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    4 days ago
  • $168k - $270.25k

    The NVIDIA DSX organization is looking for software engineering talent to build NVIDIA’s NICo technology. This software assists in the rapid...  ...and support end-to-end software solutions to manage complex cloud infrastructure deployments. You will write services and software... 
    Senior
    Full time

    NVIDIA

    Santa Clara, CA
    3 days ago
  • $168k - $258.75k

     ...Senior Technical Program Manager, DGX Cloud Software Products and Services page is loaded## Senior Technical...  ...programs emphasizing resilience, reliability, and goodput. This role requires...  ...research. You will work closely with engineering, SRE, operations, and researchers... 
    Senior

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $200k - $322k

    NVIDIA’s DGX Cloud is redefining how organizations deploy and scale AI infrastructure. We’re looking for a Senior Technical Program Manager to drive storage‑related initiatives across...  ...a high‑impact role interfacing with engineering, product, operations, finance, and our... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • $272k - $431.25k

    NVIDIA DGX Cloud is scaling GPU infrastructure across internal, partner...  ...for Principal Software Engineers to help shape the technical direction...  ...operations, automation, and reliability across large‑scale GPU clusters. This role is for senior technical leaders who can define... 

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • $120k - $171k

     ...Photonic Devices Engineer This position is cross-functional in nature and requires close cooperation with the design, development...  ...photonic devices and have a strong desire to learn about the reliability challenges associated with new product development. Your Responsibilities... 
    Senior
    Full time
    Temporary work

    Nokia

    Sunnyvale, CA
    1 day ago
  • $116k - $184k

     ...profoundly impacting society. Come join the team and help build the next era of computing! We're seeking an outstanding Senior HTOL Reliability Engineer to join our Santa Clara lab. This role requires deep device-circuitry knowledge and hands-on hardware development. You... 
    Senior

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $119.8k - $234.7k

     ...Overview Microsoft Silicon, Cloud Hardware, and Infrastructure Engineering (SCHIE) is the team behind Microsoft...  .... We are looking for a Senior Quality Engineer to join the team...  ...Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering... 
    Senior
    Ongoing contract
    Work experience placement
    Work at office
    Local area
    Worldwide

    Microsoft Corporation

    Santa Clara, CA
    1 day ago
  • $200k - $322k

    Senior Technical Program Manager - DGX Cloud Infra Security page is loaded## Senior Technical Program Manager - DGX Cloud Infra Securitylocations: US,...  ...cross-functional teams in Security, Compliance, SRE, and Engineering to continually advance and strengthen the DGX Cloud... 
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    1 day ago
  • NVIDIA Gruppe is seeking an experienced professional to lead package-level reliability for semiconductor products in Santa Clara, California. The ideal candidate will possess a Master’s or PhD in a related field, along with 8+ years of hands-on experience in IC packaging... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • What You’ll Be Doing Own the package‑level reliability spec for assigned products Define...  ...What We Need to See MS/PhD in Electrical Engineering, Materials Science, Mechanical Engineering...  ...operations Experience with data center or cloud hardware and understanding of rack and... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Reliability Engineer, DGX Cloud. Be the first to apply!