Director, Site Reliability and Software Engineering - DGX Cloud
$320kNVIDIA Corporation
Director, Site Reliability and Software Engineering - DGX Cloud page is loaded## Director, Site Reliability and Software Engineering - DGX Cloudlocations: US, CA, Santa Clara: US, Remotetime type: Full timeposted on: Posted Todayjob requisition id: JR2017420NVIDIA's invention of the GPUs ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as “the AI computing company”. We are looking to grow our company, and grow our teams with the smartest people in the world. We are looking for you.NVIDIA's GPU is hitting in market for Deep learning which is used in the research community and in industry to help solve many big data problems such as computer vision, speech recognition & translation, life science, image recognition, and natural language processing. NVIDIA GPU Cloud (NGC) is a GPU-accelerated platform that runs everywhere. Data scientists and researchers can now rapidly build, train, and deploy neural network models to address some of the most complicated AI challenges. In this Environment, NVIDIA GPU Cloud computing team is looking for leaders to work for world class Deep learning platform.**What you'll be doing:**As a Site Reliability and Software Engineering leader in the DGXC Cloud Reliability organization, you will manage the software, automation, and operations of the multi-colo distributed NVIDIA GPU cloud clusters and contribute to product strategy. You will be the leader for all aspects of cluster automation and operational excellence planning and grow your team. You thrive in a fast-paced iterative engineering environment and have experience delivering scalable distributed systems. Most importantly, you will have a track record of having past teams and cross-functional partners respect you as both a technical leader and manager, and are able to work via influence and not direct authority when needed. NVIDIA GPU Cloud Computing team works with customers across the entire company, and the ability to work across multiple different levels of technical and organizational leadership is critical. Operating with scale and speed, our world-class software engineers are just getting started -- and as a leader, you guide the way to solve reliability both our internally critical and our externally-visible systems.* Manage a team of Software and Site Reliability engineers, including program development, task planning and code reviews.* Define team strategy and roadmap, and drive adoption of scalable SDLC practices, test infrastructure, and modern practices Nvidia’s DGX Cloud Computing environment.* Drive technical projects and provide leadership in an innovative and fast-paced environment.* Be responsible for the overall planning, tracking and success of technical projects.* Work closely with project and product management teams to ensure best-in-class product development.* Contribute technically to the technical projects for DGX Cloud Computing Services.* Interact with key internal stakeholders to provide operational and financial clarity on technical spend* Drive Decision making, visibility and operational rigor across business analytic initiatives such as budget and project & portfolio reporting. Lead efforts related to executive reporting, dashboards, and operational CTO metrics focusing on continuous improvement and evolution to maximize decision making and executive visibility.**What we need to see:*** 12+ overall years of Experience in engineering management. 5+ years of leadership.* Bachelor / Master degree in Computer Science, or equivalent experience.* Experience in designing and implementing large-scale distributed systems. Experience in Containers / Virtualization environments/ Cluster solutions Experience in managing Technical Support / DevOps teams. Set appropriate technical excellent bars and deliver projects in tight deadlines.* Strong knowledge in Unix/Linux.* Experience implementing tools, process, internal instrumentation, methodologies and resolving blockages* Demonstrated people management and leadership skills, the proven track record of mentoring and coaching team members.* Ability to quickly learn and evaluate new technologies.* Ability to influence and establish relationships with other software and IT functional groups such as development, server, storage and security teams.We have some of the most forward-thinking and hardworking people in the world working for us and, due to unprecedented growth, our exclusive engineering teams are rapidly growing. If you're a creative and autonomous engineer with a real passion for technology, we want to hear from you!NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence.Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 320,000 USD - 488,750 USD for Level 5, and 384,000 USD - 575,000 USD for Level 6.You will also be eligible for equity and .Applications for this job will be accepted at least until May 8, 2026.This posting is for an existing vacancy.NVIDIA uses AI tools in its recruiting processes.NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. #J-18808-Ljbffr NVIDIA Corporation
$207k - $300k
Site Reliability Engineering Manager, Google Distributed Cloud Google Sunnyvale, CA, USA Bachelor’s degree in Computer Science, a related field, or equivalent practical... ...job Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-...WebsiteFull time$200k - $322k
...NVIDIA's DGX Cloud (DGXC) powers AI for strategic research and product... ...NVIDIA’s next-generation AI software platforms. In this role, you... ...is on enabling scalable, reliable, and supportable software... ...responsible for managing high-impact engineering programs within a dynamic,...Suggested$200k - $322k
...DGX Cloud Team is looking for a Senior Technical Program Manager... ...This position involves leading software-related initiatives across... ...responsible for managing high-impact engineering programs within a dynamic,... ...automation, and service reliability practices in cloud...SuggestedShift work$168k - $258.75k
...Technical Program Manager, DGX Cloud Software Products and Services page... ...programs emphasizing resilience, reliability, and goodput. This role... ...You will work closely with engineering, SRE, operations, and... ...infrastructure, platform, site reliability, operational, and...Website$184.12k - $275.45k
General Motors is looking for a Staff Engineer in Sunnyvale to join the Hybrid Services & Reliability (HSR) team. Responsibilities include leading SLOs for cloud services, automating server... ...will have extensive experience in Site Reliability Engineering and Linux systems...WebsiteWork at office$250k
...management solutions provider in Sunnyvale is looking for a Director of Site Reliability Engineering. The ideal candidate will lead a team ensuring the... ...Excellent leadership skills and over 10 years of experience in software operations, including SRE leadership, are essential. A...Website$200k - $322k
...Manager to lead Trust Services programs for DGX Cloud. DGX Cloud powers large-scale AI... ...security, product security, compliance, engineering execution, and partner readiness. This... ...standards across firmware, platform, and software teams. Establish program structure through...$176k - $276k
Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain... ...availability using the combination of software and systems engineering practices.... ...delivery and deployment and open source cloud enabling technologies like Kubernetes...Website$152k - $241.5k
...dedicated Senior AI Infrastructure Engineer to join our DGX Cloud group. This engineering role will design... ...availability using a combination of software and systems engineering practices.... ...GPU cloud services deliver maximum reliability and uptime. They carefully prepare and...$170k - $185k
...Program Manager to join our engineering organization in Santa... ...spanning hardware, software, and integrated... ...successful releases of cloud software and/or handoffs... ...lunches, etc.) • On-site Health & Wellness programs... ..., ease of use, and reliability. All qualified applicants...WebsiteFull timeTemporary workSummer holidayWorldwideFlexible hours$227k - $320k
Technical Program Manager, Google Cloud Platform Reliability corporate_fare Google place Sunnyvale, CA, USA Apply Bachelor'... ...10 years of experience in program management or engineering leadership. Experience with site reliability engineering, developer operations, and...WebsiteFull timeLocal area$136k - $224.25k
...NVIDIA is looking for a Senior Network Reliability Engineer to support and maintain our cloud and datacenter network infrastructures. This network serves the needs across the whole software stack for NVIDIA, from Graphics Drivers to Autonomous Vehicles and Artificial...Remote workShift work$200k - $322k
...Technical Program Manager passionate about Cloud Security, you will drive the DGX Cloud infrastructure security... ...into execution roadmaps and the software development lifecycle. It aligns product... ...in Security, Compliance, SRE, and Engineering to continually advance and...$195.9k - $293.9k
...Senior Manager/Associate Director, Software, Platform Engineering page is loaded## Senior Manager... ...Your mission is to enable reliable, scalable, secure, and... ...experience is limited to cloud-native services, SaaS platforms... ...Agencies: Our Careers Site is only for individuals seeking...WebsiteWork from homeWorldwideMonday to Friday$272k - $431.25k
...NVIDIA DGX Cloud is scaling GPU infrastructure across internal, partner, and cloud environments... .... We are looking for Principal Software Engineers to help shape the technical direction... ...-based operations, automation, and reliability across large‑scale GPU clusters. This...- Zocdoc, located in Silicon Valley, CA, is seeking a Senior Site Reliability Engineer to monitor and maintain cloud-based systems ensuring uptime for millions of patients. You'll work with cutting-edge technology in a diverse and collaborative environment. This role requires...Website
$160k - $250k
...the Role: This is a Technical Engineering Manager role (50% Management... ...behavior, paired with a cloud component that aggregates telemetry... ...You'll Need: 10+ years of software engineering experience with... ...high concurrency and high reliability production environments. Experience...WebsiteFull timeWork experience placementWork at officeLocal area2 days per week3 days per week$384k
...NVIDIA is seeking a Senior Director, System Software Engineering, to lead strategy and execution for capacity management in DGX Cloud, building the capacity foundation for NVIDIA's internal... ...developer platform leaders to deliver reliable, high-performance software that powers...- ...is seeking a senior platform engineer to manage production infrastructure across multiple clouds. You will deploy and maintain... ...CD pipelines, ensuring system reliability, and collaborating with various... ...security mindset. This is an on-site position in Sunnyvale,...WebsiteWork at office2 days per week
- Rubrik, Inc. seeks a Staff Site Reliability Engineer in Palo Alto, California to lead reliability and performance of enterprise... .... The ideal candidate will have 8-12+ years in software engineering, with a strong background in cloud systems and operational excellence. Key...Website
- ...A leading cybersecurity firm is seeking a Senior Backend Software Engineer to focus on the Azure Firewall Management Program. This position... ...requires coding experience in Go / Golang and familiarity with cloud environments like AWS or Azure. You will work on integrating...WebsiteWork at office
$200k - $322k
...NVIDIA’s DGX Cloud is redefining how organizations deploy and scale AI infrastructure.... ...is a high-impact role interfacing with engineering, product, operations, finance, and our... ...experience in program management of large-scale software or infrastructure projects ~ MS EE or...$224k - $356.5k
...on the world.As part of the DGX Cloud organization, the Attestation... ..., silicon, and cloud engineering teams to turn embedded hardware... ...attestation standards into reliable, self-service cloud capabilities... ...security, silicon, platform, and software teams to deliver end-to-end...Remote work$184k - $287.5k
## Senior Software Engineer, DGX Cloud AI InfrastructureApplylocations: US, CA, Santa Clara: US, TX, Austin: US, OR, Remote: US, WA, Remote: US,... ...ensure state-of-the-art LLM workloads run efficiently and reliably at scale. You will lead deep performance and reliability investigations...Remote work- A leading tech recruiting firm is seeking a Site Reliability Engineer to manage and optimize cloud infrastructure primarily using GCP or AWS. The role involves maintaining high availability through Kubernetes clusters and improving CI/CD pipelines with Terraform. Ideal...Website
$165k - $241.4k
...the intersection of applied AI, cloud infrastructure and security - partnering across engineering, security, compliance, and... ...the standard for automation and reliability that enables our AI models to scale... .... Please see the Cisco careers site to discover more benefits and...WebsiteFull timeTemporary workLocal areaFlexible hours$272k - $431.25k
...—and amazing people. We are looking for a Principal Software Engineer to join our DGX Cloud team and build the foundational systems that drive NVIDIA... ...to ensure cohesive integration, clear interfaces, and reliable end-to-end workflows, with a strong focus on delivery....$163k - $237k
...Infrastructure Resource Management, Google Cloud Apply X Note: By applying to this... ...in product development with engineers. The Machine Learning Resource... ...experience. Collaborate closely with Software Engineering (SWE) and Site Reliability Engineering (SRE) teams to uncover...WebsiteFull time$138k - $198k
Technical Program Manager II, Hardware Quality and Reliability, Data Centers Google Sunnyvale, CA, USA Qualifications Bachelor's... ...qualifications Master's degree in electrical, mechanical, or software engineering. 2 years of experience managing cross-functional or cross-...Full timeWork at office$120.3k - $194.53k
...that drives great outcomes. Job Summary Palo Alto Networks runs a large hybrid infrastructure across multiple public clouds. As a Site Reliability Engineer on the Internet Security Platform team, you will be part of a team supporting Advanced DNS Security services. This...WebsiteFull timeWork at officeVisa sponsorshipWork visa
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Director, Site Reliability and Software Engineering - DGX Cloud. Be the first to apply!
- software manager Santa Clara, CA
- application engineering manager Santa Clara, CA
- IT software development manager Santa Clara, CA
- application manager Santa Clara, CA
- director of software Santa Clara, CA
- senior applications manager Santa Clara, CA
- senior software manager Santa Clara, CA
- embedded software manager Santa Clara, CA
- director software engineering Santa Clara, CA
- senior software development manager Santa Clara, CA

