Director, Site Reliability and Software Engineering - DGX Cloud

$320k

NVIDIA Corporation

Director, Site Reliability and Software Engineering - DGX Cloud page is loaded## Director, Site Reliability and Software Engineering - DGX Cloudlocations: US, CA, Santa Clara: US, Remotetime type: Full timeposted on: Posted Todayjob requisition id: JR2017420NVIDIA's invention of the GPUs ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as “the AI computing company”. We are looking to grow our company, and grow our teams with the smartest people in the world. We are looking for you.NVIDIA's GPU is hitting in market for Deep learning which is used in the research community and in industry to help solve many big data problems such as computer vision, speech recognition & translation, life science, image recognition, and natural language processing. NVIDIA GPU Cloud (NGC) is a GPU-accelerated platform that runs everywhere. Data scientists and researchers can now rapidly build, train, and deploy neural network models to address some of the most complicated AI challenges. In this Environment, NVIDIA GPU Cloud computing team is looking for leaders to work for world class Deep learning platform.**What you'll be doing:**As a Site Reliability and Software Engineering leader in the DGXC Cloud Reliability organization, you will manage the software, automation, and operations of the multi-colo distributed NVIDIA GPU cloud clusters and contribute to product strategy. You will be the leader for all aspects of cluster automation and operational excellence planning and grow your team. You thrive in a fast-paced iterative engineering environment and have experience delivering scalable distributed systems. Most importantly, you will have a track record of having past teams and cross-functional partners respect you as both a technical leader and manager, and are able to work via influence and not direct authority when needed. NVIDIA GPU Cloud Computing team works with customers across the entire company, and the ability to work across multiple different levels of technical and organizational leadership is critical. Operating with scale and speed, our world-class software engineers are just getting started -- and as a leader, you guide the way to solve reliability both our internally critical and our externally-visible systems.* Manage a team of Software and Site Reliability engineers, including program development, task planning and code reviews.* Define team strategy and roadmap, and drive adoption of scalable SDLC practices, test infrastructure, and modern practices Nvidia’s DGX Cloud Computing environment.* Drive technical projects and provide leadership in an innovative and fast-paced environment.* Be responsible for the overall planning, tracking and success of technical projects.* Work closely with project and product management teams to ensure best-in-class product development.* Contribute technically to the technical projects for DGX Cloud Computing Services.* Interact with key internal stakeholders to provide operational and financial clarity on technical spend* Drive Decision making, visibility and operational rigor across business analytic initiatives such as budget and project & portfolio reporting. Lead efforts related to executive reporting, dashboards, and operational CTO metrics focusing on continuous improvement and evolution to maximize decision making and executive visibility.**What we need to see:*** 12+ overall years of Experience in engineering management. 5+ years of leadership.* Bachelor / Master degree in Computer Science, or equivalent experience.* Experience in designing and implementing large-scale distributed systems. Experience in Containers / Virtualization environments/ Cluster solutions Experience in managing Technical Support / DevOps teams. Set appropriate technical excellent bars and deliver projects in tight deadlines.* Strong knowledge in Unix/Linux.* Experience implementing tools, process, internal instrumentation, methodologies and resolving blockages* Demonstrated people management and leadership skills, the proven track record of mentoring and coaching team members.* Ability to quickly learn and evaluate new technologies.* Ability to influence and establish relationships with other software and IT functional groups such as development, server, storage and security teams.We have some of the most forward-thinking and hardworking people in the world working for us and, due to unprecedented growth, our exclusive engineering teams are rapidly growing. If you're a creative and autonomous engineer with a real passion for technology, we want to hear from you!NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence.Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 320,000 USD - 488,750 USD for Level 5, and 384,000 USD - 575,000 USD for Level 6.You will also be eligible for equity and .Applications for this job will be accepted at least until May 8, 2026.This posting is for an existing vacancy.NVIDIA uses AI tools in its recruiting processes.NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. #J-18808-Ljbffr NVIDIA Corporation

Apply

Vacancy posted 15 hours ago

Similar jobs that could be interesting for youBased on the Director, Site Reliability and Software Engineering - DGX Cloud in Santa Clara, CA vacancy

Site Reliability Engineering Manager, Google Distributed Cloud
$207k - $300k
Site Reliability Engineering Manager, Google Distributed Cloud Google Sunnyvale, CA, USA Bachelor’s degree in Computer Science, a related field, or equivalent practical... ...job Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-...
Website
Full time
Google Inc.
Sunnyvale, CA
3 days ago
Senior Technical Program Manager, DGX Cloud Software - Product and Services
$200k - $322k
...NVIDIA's DGX Cloud (DGXC) powers AI for strategic research and product... ...NVIDIA’s next-generation AI software platforms. In this role, you... ...is on enabling scalable, reliable, and supportable software... ...responsible for managing high-impact engineering programs within a dynamic,...
Suggested
NVIDIA
Santa Clara, CA
7 days ago
Senior Technical Program Manager, DGX Cloud Software Products and Service
$200k - $322k
...DGX Cloud Team is looking for a Senior Technical Program Manager... ...This position involves leading software-related initiatives across... ...responsible for managing high-impact engineering programs within a dynamic,... ...automation, and service reliability practices in cloud...
Suggested
Shift work
NVIDIA
Santa Clara, CA
4 days ago
Senior Technical Program Manager, DGX Cloud Software Products and Services
$168k - $258.75k
...Technical Program Manager, DGX Cloud Software Products and Services page... ...programs emphasizing resilience, reliability, and goodput. This role... ...You will work closely with engineering, SRE, operations, and... ...infrastructure, platform, site reliability, operational, and...
Website
NVIDIA
Santa Clara, CA
2 days ago
Staff Engineer - Hybrid Cloud Reliability & SRE
$184.12k - $275.45k
General Motors is looking for a Staff Engineer in Sunnyvale to join the Hybrid Services & Reliability (HSR) team. Responsibilities include leading SLOs for cloud services, automating server... ...will have extensive experience in Site Reliability Engineering and Linux systems...
Website
Work at office
General Motors
Sunnyvale, CA
1 day ago
Director, SRE & Reliability for AI Knowledge Platform
$250k
...management solutions provider in Sunnyvale is looking for a Director of Site Reliability Engineering. The ideal candidate will lead a team ensuring the... ...Excellent leadership skills and over 10 years of experience in software operations, including SRE leadership, are essential. A...
Website
eGain
Sunnyvale, CA
2 days ago
Senior Technical Program Manager, DGX Cloud - Trust Services
$200k - $322k
...Manager to lead Trust Services programs for DGX Cloud. DGX Cloud powers large-scale AI... ...security, product security, compliance, engineering execution, and partner readiness. This... ...standards across firmware, platform, and software teams. Establish program structure through...
NVIDIA
Santa Clara, CA
7 days ago
Senior Site Reliability Engineer - Observability and Telemetry Platform
$176k - $276k
Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain... ...availability using the combination of software and systems engineering practices.... ...delivery and deployment and open source cloud enabling technologies like Kubernetes...
Website
NVIDIA Corporation
Santa Clara, CA
4 days ago
Senior AI Infrastructure Engineer - DGX Cloud
$152k - $241.5k
...dedicated Senior AI Infrastructure Engineer to join our DGX Cloud group. This engineering role will design... ...availability using a combination of software and systems engineering practices.... ...GPU cloud services deliver maximum reliability and uptime. They carefully prepare and...
NVIDIA
Santa Clara, CA
6 days ago
Senior Technical Program Manager
$170k - $185k
...Program Manager to join our engineering organization in Santa... ...spanning hardware, software, and integrated... ...successful releases of cloud software and/or handoffs... ...lunches, etc.) • On-site Health & Wellness programs... ..., ease of use, and reliability. All qualified applicants...
Website
Full time
Temporary work
Summer holiday
Worldwide
Flexible hours
Picarro
Santa Clara, CA
4 days ago
Technical Program Manager, Google Cloud Platform Reliability
$227k - $320k
Technical Program Manager, Google Cloud Platform Reliability corporate_fare Google place Sunnyvale, CA, USA Apply Bachelor'... ...10 years of experience in program management or engineering leadership. Experience with site reliability engineering, developer operations, and...
Website
Full time
Local area
Google Inc.
Sunnyvale, CA
3 days ago
Senior Network Reliability Engineer - DGX Cloud
$136k - $224.25k
...NVIDIA is looking for a Senior Network Reliability Engineer to support and maintain our cloud and datacenter network infrastructures. This network serves the needs across the whole software stack for NVIDIA, from Graphics Drivers to Autonomous Vehicles and Artificial...
Remote work
Shift work
NVIDIA
Santa Clara, CA
3 days ago
Senior Technical Program Manager - DGX Cloud Infra Security
$200k - $322k
...Technical Program Manager passionate about Cloud Security, you will drive the DGX Cloud infrastructure security... ...into execution roadmaps and the software development lifecycle. It aligns product... ...in Security, Compliance, SRE, and Engineering to continually advance and...
NVIDIA
Santa Clara, CA
7 days ago
Senior Manager/Associate Director, Software, Platform Engineering
$195.9k - $293.9k
...Senior Manager/Associate Director, Software, Platform Engineering page is loaded## Senior Manager... ...Your mission is to enable reliable, scalable, secure, and... ...experience is limited to cloud-native services, SaaS platforms... ...Agencies: Our Careers Site is only for individuals seeking...
Website
Work from home
Worldwide
Monday to Friday
pacb.com
Menlo Park, CA
3 days ago
Principal Software Engineer, DGX Cloud Production Engineering
$272k - $431.25k
...NVIDIA DGX Cloud is scaling GPU infrastructure across internal, partner, and cloud environments... .... We are looking for Principal Software Engineers to help shape the technical direction... ...-based operations, automation, and reliability across large‑scale GPU clusters. This...
NVIDIA Gruppe
Santa Clara, CA
2 days ago
Senior Site Reliability Engineer | Uptime, Cloud & GenAI
Zocdoc, located in Silicon Valley, CA, is seeking a Senior Site Reliability Engineer to monitor and maintain cloud-based systems ensuring uptime for millions of patients. You'll work with cutting-edge technology in a diverse and collaborative environment. This role requires...
Website
Dormont Manufacturing Co
Palo Alto, CA
1 day ago
Technical Engineering Manager, Sensor Platform (AI-Driven)
$160k - $250k
...the Role: This is a Technical Engineering Manager role (50% Management... ...behavior, paired with a cloud component that aggregates telemetry... ...You'll Need: 10+ years of software engineering experience with... ...high concurrency and high reliability production environments. Experience...
Website
Full time
Work experience placement
Work at office
Local area
2 days per week
3 days per week
Koitecc Solutions
Sunnyvale, CA
2 days ago
Senior Director, System Software Engineering - DGX Cloud
$384k
...NVIDIA is seeking a Senior Director, System Software Engineering, to lead strategy and execution for capacity management in DGX Cloud, building the capacity foundation for NVIDIA's internal... ...developer platform leaders to deliver reliable, high-performance software that powers...
NVIDIA
Santa Clara, CA
4 days ago
SRE & DevOps Engineer: Platform Reliability & Automation
...is seeking a senior platform engineer to manage production infrastructure across multiple clouds. You will deploy and maintain... ...CD pipelines, ensuring system reliability, and collaborating with various... ...security mindset. This is an on-site position in Sunnyvale,...
Website
Work at office
2 days per week
Koitecc Solutions
Sunnyvale, CA
2 days ago
Staff SRE: Cloud Reliability Architect & Leader
Rubrik, Inc. seeks a Staff Site Reliability Engineer in Palo Alto, California to lead reliability and performance of enterprise... .... The ideal candidate will have 8-12+ years in software engineering, with a strong background in cloud systems and operational excellence. Key...
Website
Rubrik, Inc.
Palo Alto, CA
4 days ago
Site Reliability Engineer II Cloud Security & Networking
...A leading cybersecurity firm is seeking a Senior Backend Software Engineer to focus on the Azure Firewall Management Program. This position... ...requires coding experience in Go / Golang and familiarity with cloud environments like AWS or Azure. You will work on integrating...
Website
Work at office
Illumio
Sunnyvale, CA
2 days ago
Senior Technical Program Manager - DGX Cloud Storage
$200k - $322k
...NVIDIA’s DGX Cloud is redefining how organizations deploy and scale AI infrastructure.... ...is a high-impact role interfacing with engineering, product, operations, finance, and our... ...experience in program management of large-scale software or infrastructure projects ~ MS EE or...
NVIDIA
Santa Clara, CA
4 days ago
Senior Software Engineer, Attestation Services - DGX Cloud
$224k - $356.5k
...on the world.As part of the DGX Cloud organization, the Attestation... ..., silicon, and cloud engineering teams to turn embedded hardware... ...attestation standards into reliable, self-service cloud capabilities... ...security, silicon, platform, and software teams to deliver end-to-end...
Remote work
NVIDIA
Santa Clara, CA
3 days ago
Senior Software Engineer, DGX Cloud AI Infrastructure
$184k - $287.5k
## Senior Software Engineer, DGX Cloud AI InfrastructureApplylocations: US, CA, Santa Clara: US, TX, Austin: US, OR, Remote: US, WA, Remote: US,... ...ensure state-of-the-art LLM workloads run efficiently and reliably at scale. You will lead deep performance and reliability investigations...
Remote work
NVIDIA
Santa Clara, CA
2 days ago
Senior Site Reliability Engineer: Cloud, Kubernetes & CI/CD
A leading tech recruiting firm is seeking a Site Reliability Engineer to manage and optimize cloud infrastructure primarily using GCP or AWS. The role involves maintaining high availability through Kubernetes clusters and improving CI/CD pipelines with Terraform. Ideal...
Website
Amiri Recruiting
Mountain View, CA
3 days ago
Lead Site Reliability Engineer (GCP & Hybrid Cloud) Hybrid
$165k - $241.4k
...the intersection of applied AI, cloud infrastructure and security - partnering across engineering, security, compliance, and... ...the standard for automation and reliability that enables our AI models to scale... .... Please see the Cisco careers site to discover more benefits and...
Website
Full time
Temporary work
Local area
Flexible hours
Cisco
San Jose, CA
3 days ago
Principal Software Engineer - DGX Cloud
$272k - $431.25k
...—and amazing people. We are looking for a Principal Software Engineer to join our DGX Cloud team and build the foundational systems that drive NVIDIA... ...to ensure cohesive integration, clear interfaces, and reliable end-to-end workflows, with a strong focus on delivery....
NVIDIA
Santa Clara, CA
6 days ago
Technical Program Manager III, ML Infrastructure Resource Management, Google Cloud
$163k - $237k
...Infrastructure Resource Management, Google Cloud Apply X Note: By applying to this... ...in product development with engineers. The Machine Learning Resource... ...experience. Collaborate closely with Software Engineering (SWE) and Site Reliability Engineering (SRE) teams to uncover...
Website
Full time
Google Inc.
Sunnyvale, CA
5 days ago
Technical Program Manager II, Hardware Quality and Reliability, Data Centers
$138k - $198k
Technical Program Manager II, Hardware Quality and Reliability, Data Centers Google Sunnyvale, CA, USA Qualifications Bachelor's... ...qualifications Master's degree in electrical, mechanical, or software engineering. 2 years of experience managing cross-functional or cross-...
Full time
Work at office
Google Inc.
Sunnyvale, CA
5 days ago
Sr Site Reliability Engineer (Internet Security Platform)
$120.3k - $194.53k
...that drives great outcomes. Job Summary Palo Alto Networks runs a large hybrid infrastructure across multiple public clouds. As a Site Reliability Engineer on the Internet Security Platform team, you will be part of a team supporting Advanced DNS Security services. This...
Website
Full time
Work at office
Visa sponsorship
Work visa
Palo Alto Networks
Santa Clara, CA
2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Director, Site Reliability and Software Engineering - DGX Cloud. Be the first to apply!