Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Director, Site Reliability and Software Engineering - DGX Cloud

$320k

NVIDIA Corporation

Director, Site Reliability and Software Engineering - DGX Cloud page is loaded## Director, Site Reliability and Software Engineering - DGX Cloudlocations: US, CA, Santa Clara: US, Remotetime type: Full timeposted on: Posted Todayjob requisition id: JR2017420NVIDIA's invention of the GPUs ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as “the AI computing company”. We are looking to grow our company, and grow our teams with the smartest people in the world. We are looking for you.NVIDIA's GPU is hitting in market for Deep learning which is used in the research community and in industry to help solve many big data problems such as computer vision, speech recognition & translation, life science, image recognition, and natural language processing. NVIDIA GPU Cloud (NGC) is a GPU-accelerated platform that runs everywhere. Data scientists and researchers can now rapidly build, train, and deploy neural network models to address some of the most complicated AI challenges. In this Environment, NVIDIA GPU Cloud computing team is looking for leaders to work for world class Deep learning platform.**What you'll be doing:**As a Site Reliability and Software Engineering leader in the DGXC Cloud Reliability organization, you will manage the software, automation, and operations of the multi-colo distributed NVIDIA GPU cloud clusters and contribute to product strategy. You will be the leader for all aspects of cluster automation and operational excellence planning and grow your team. You thrive in a fast-paced iterative engineering environment and have experience delivering scalable distributed systems. Most importantly, you will have a track record of having past teams and cross-functional partners respect you as both a technical leader and manager, and are able to work via influence and not direct authority when needed. NVIDIA GPU Cloud Computing team works with customers across the entire company, and the ability to work across multiple different levels of technical and organizational leadership is critical. Operating with scale and speed, our world-class software engineers are just getting started -- and as a leader, you guide the way to solve reliability both our internally critical and our externally-visible systems.* Manage a team of Software and Site Reliability engineers, including program development, task planning and code reviews.* Define team strategy and roadmap, and drive adoption of scalable SDLC practices, test infrastructure, and modern practices Nvidia’s DGX Cloud Computing environment.* Drive technical projects and provide leadership in an innovative and fast-paced environment.* Be responsible for the overall planning, tracking and success of technical projects.* Work closely with project and product management teams to ensure best-in-class product development.* Contribute technically to the technical projects for DGX Cloud Computing Services.* Interact with key internal stakeholders to provide operational and financial clarity on technical spend* Drive Decision making, visibility and operational rigor across business analytic initiatives such as budget and project & portfolio reporting. Lead efforts related to executive reporting, dashboards, and operational CTO metrics focusing on continuous improvement and evolution to maximize decision making and executive visibility.**What we need to see:*** 12+ overall years of Experience in engineering management. 5+ years of leadership.* Bachelor / Master degree in Computer Science, or equivalent experience.* Experience in designing and implementing large-scale distributed systems. Experience in Containers / Virtualization environments/ Cluster solutions Experience in managing Technical Support / DevOps teams. Set appropriate technical excellent bars and deliver projects in tight deadlines.* Strong knowledge in Unix/Linux.* Experience implementing tools, process, internal instrumentation, methodologies and resolving blockages* Demonstrated people management and leadership skills, the proven track record of mentoring and coaching team members.* Ability to quickly learn and evaluate new technologies.* Ability to influence and establish relationships with other software and IT functional groups such as development, server, storage and security teams.We have some of the most forward-thinking and hardworking people in the world working for us and, due to unprecedented growth, our exclusive engineering teams are rapidly growing. If you're a creative and autonomous engineer with a real passion for technology, we want to hear from you!NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence.Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 320,000 USD - 488,750 USD for Level 5, and 384,000 USD - 575,000 USD for Level 6.You will also be eligible for equity and .Applications for this job will be accepted at least until May 8, 2026.This posting is for an existing vacancy.NVIDIA uses AI tools in its recruiting processes.NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. #J-18808-Ljbffr NVIDIA Corporation

Vacancy posted 15 hours ago
Similar jobs that could be interesting for youBased on the Director, Site Reliability and Software Engineering - DGX Cloud in Santa Clara, CA vacancy
  • $207k - $300k

    Site Reliability Engineering Manager, Google Distributed Cloud Google Sunnyvale, CA, USA Bachelor’s degree in Computer Science, a related field, or equivalent practical...  ...job Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-... 
    Website
    Full time

    Google Inc.

    Sunnyvale, CA
    3 days ago
  • $200k - $322k

     ...NVIDIA's DGX Cloud (DGXC) powers AI for strategic research and product...  ...NVIDIA’s next-generation AI software platforms. In this role, you...  ...is on enabling scalable, reliable, and supportable software...  ...responsible for managing high-impact engineering programs within a dynamic,... 
    Suggested

    NVIDIA

    Santa Clara, CA
    7 days ago
  • $200k - $322k

     ...DGX Cloud Team is looking for a Senior Technical Program Manager...  ...This position involves leading software-related initiatives across...  ...responsible for managing high-impact engineering programs within a dynamic,...  ...automation, and service reliability practices in cloud... 
    Suggested
    Shift work

    NVIDIA

    Santa Clara, CA
    4 days ago
  • $168k - $258.75k

     ...Technical Program Manager, DGX Cloud Software Products and Services page...  ...programs emphasizing resilience, reliability, and goodput. This role...  ...You will work closely with engineering, SRE, operations, and...  ...infrastructure, platform, site reliability, operational, and... 
    Website

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $184.12k - $275.45k

    General Motors is looking for a Staff Engineer in Sunnyvale to join the Hybrid Services & Reliability (HSR) team. Responsibilities include leading SLOs for cloud services, automating server...  ...will have extensive experience in Site Reliability Engineering and Linux systems... 
    Website
    Work at office

    General Motors

    Sunnyvale, CA
    1 day ago
  • $250k

     ...management solutions provider in Sunnyvale is looking for a Director of Site Reliability Engineering. The ideal candidate will lead a team ensuring the...  ...Excellent leadership skills and over 10 years of experience in software operations, including SRE leadership, are essential. A... 
    Website

    eGain

    Sunnyvale, CA
    2 days ago
  • $200k - $322k

     ...Manager to lead Trust Services programs for DGX Cloud. DGX Cloud powers large-scale AI...  ...security, product security, compliance, engineering execution, and partner readiness. This...  ...standards across firmware, platform, and software teams. Establish program structure through... 

    NVIDIA

    Santa Clara, CA
    7 days ago
  • $176k - $276k

    Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain...  ...availability using the combination of software and systems engineering practices....  ...delivery and deployment and open source cloud enabling technologies like Kubernetes... 
    Website

    NVIDIA Corporation

    Santa Clara, CA
    4 days ago
  • $152k - $241.5k

     ...dedicated Senior AI Infrastructure Engineer to join our DGX Cloud group. This engineering role will design...  ...availability using a combination of software and systems engineering practices....  ...GPU cloud services deliver maximum reliability and uptime. They carefully prepare and... 

    NVIDIA

    Santa Clara, CA
    6 days ago
  • $170k - $185k

     ...Program Manager to join our engineering organization in Santa...  ...spanning hardware, software, and integrated...  ...successful releases of cloud software and/or handoffs...  ...lunches, etc.) • On-site Health & Wellness programs...  ..., ease of use, and reliability. All qualified applicants... 
    Website
    Full time
    Temporary work
    Summer holiday
    Worldwide
    Flexible hours

    Picarro

    Santa Clara, CA
    4 days ago
  • $227k - $320k

    Technical Program Manager, Google Cloud Platform Reliability corporate_fare Google place Sunnyvale, CA, USA Apply Bachelor'...  ...10 years of experience in program management or engineering leadership. Experience with site reliability engineering, developer operations, and... 
    Website
    Full time
    Local area

    Google Inc.

    Sunnyvale, CA
    3 days ago
  • $136k - $224.25k

     ...NVIDIA is looking for a Senior Network Reliability Engineer to support and maintain our cloud and datacenter network infrastructures. This network serves the needs across the whole software stack for NVIDIA, from Graphics Drivers to Autonomous Vehicles and Artificial... 
    Remote work
    Shift work

    NVIDIA

    Santa Clara, CA
    3 days ago
  • $200k - $322k

     ...Technical Program Manager passionate about Cloud Security, you will drive the DGX Cloud infrastructure security...  ...into execution roadmaps and the software development lifecycle. It aligns product...  ...in Security, Compliance, SRE, and Engineering to continually advance and... 

    NVIDIA

    Santa Clara, CA
    7 days ago
  • $195.9k - $293.9k

     ...Senior Manager/Associate Director, Software, Platform Engineering page is loaded## Senior Manager...  ...Your mission is to enable reliable, scalable, secure, and...  ...experience is limited to cloud-native services, SaaS platforms...  ...Agencies: Our Careers Site is only for individuals seeking... 
    Website
    Work from home
    Worldwide
    Monday to Friday

    pacb.com

    Menlo Park, CA
    3 days ago
  • $272k - $431.25k

     ...NVIDIA DGX Cloud is scaling GPU infrastructure across internal, partner, and cloud environments...  .... We are looking for Principal Software Engineers to help shape the technical direction...  ...-based operations, automation, and reliability across large‑scale GPU clusters. This... 

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • Zocdoc, located in Silicon Valley, CA, is seeking a Senior Site Reliability Engineer to monitor and maintain cloud-based systems ensuring uptime for millions of patients. You'll work with cutting-edge technology in a diverse and collaborative environment. This role requires... 
    Website

    Dormont Manufacturing Co

    Palo Alto, CA
    1 day ago
  • $160k - $250k

     ...the Role: This is a Technical Engineering Manager role (50% Management...  ...behavior, paired with a cloud component that aggregates telemetry...  ...You'll Need: 10+ years of software engineering experience with...  ...high concurrency and high reliability production environments. Experience... 
    Website
    Full time
    Work experience placement
    Work at office
    Local area
    2 days per week
    3 days per week

    Koitecc Solutions

    Sunnyvale, CA
    2 days ago
  • $384k

     ...NVIDIA is seeking a Senior Director, System Software Engineering, to lead strategy and execution for capacity management in DGX Cloud, building the capacity foundation for NVIDIA's internal...  ...developer platform leaders to deliver reliable, high-performance software that powers... 

    NVIDIA

    Santa Clara, CA
    4 days ago
  •  ...is seeking a senior platform engineer to manage production infrastructure across multiple clouds. You will deploy and maintain...  ...CD pipelines, ensuring system reliability, and collaborating with various...  ...security mindset. This is an on-site position in Sunnyvale,... 
    Website
    Work at office
    2 days per week

    Koitecc Solutions

    Sunnyvale, CA
    2 days ago
  • Rubrik, Inc. seeks a Staff Site Reliability Engineer in Palo Alto, California to lead reliability and performance of enterprise...  .... The ideal candidate will have 8-12+ years in software engineering, with a strong background in cloud systems and operational excellence. Key... 
    Website

    Rubrik, Inc.

    Palo Alto, CA
    4 days ago
  •  ...A leading cybersecurity firm is seeking a Senior Backend Software Engineer to focus on the Azure Firewall Management Program. This position...  ...requires coding experience in Go / Golang and familiarity with cloud environments like AWS or Azure. You will work on integrating... 
    Website
    Work at office

    Illumio

    Sunnyvale, CA
    2 days ago
  • $200k - $322k

     ...NVIDIA’s DGX Cloud is redefining how organizations deploy and scale AI infrastructure....  ...is a high-impact role interfacing with engineering, product, operations, finance, and our...  ...experience in program management of large-scale software or infrastructure projects ~ MS EE or... 

    NVIDIA

    Santa Clara, CA
    4 days ago
  • $224k - $356.5k

     ...on the world.As part of the DGX Cloud organization, the Attestation...  ..., silicon, and cloud engineering teams to turn embedded hardware...  ...attestation standards into reliable, self-service cloud capabilities...  ...security, silicon, platform, and software teams to deliver end-to-end... 
    Remote work

    NVIDIA

    Santa Clara, CA
    3 days ago
  • $184k - $287.5k

    ## Senior Software Engineer, DGX Cloud AI InfrastructureApplylocations: US, CA, Santa Clara: US, TX, Austin: US, OR, Remote: US, WA, Remote: US,...  ...ensure state-of-the-art LLM workloads run efficiently and reliably at scale. You will lead deep performance and reliability investigations... 
    Remote work

    NVIDIA

    Santa Clara, CA
    2 days ago
  • A leading tech recruiting firm is seeking a Site Reliability Engineer to manage and optimize cloud infrastructure primarily using GCP or AWS. The role involves maintaining high availability through Kubernetes clusters and improving CI/CD pipelines with Terraform. Ideal... 
    Website

    Amiri Recruiting

    Mountain View, CA
    3 days ago
  • $165k - $241.4k

     ...the intersection of applied AI, cloud infrastructure and security - partnering across engineering, security, compliance, and...  ...the standard for automation and reliability that enables our AI models to scale...  .... Please see the Cisco careers site to discover more benefits and... 
    Website
    Full time
    Temporary work
    Local area
    Flexible hours

    Cisco

    San Jose, CA
    3 days ago
  • $272k - $431.25k

     ...—and amazing people. We are looking for a Principal Software Engineer to join our DGX Cloud team and build the foundational systems that drive NVIDIA...  ...to ensure cohesive integration, clear interfaces, and reliable end-to-end workflows, with a strong focus on delivery.... 

    NVIDIA

    Santa Clara, CA
    6 days ago
  • $163k - $237k

     ...Infrastructure Resource Management, Google Cloud Apply X Note: By applying to this...  ...in product development with engineers. The Machine Learning Resource...  ...experience. Collaborate closely with Software Engineering (SWE) and Site Reliability Engineering (SRE) teams to uncover... 
    Website
    Full time

    Google Inc.

    Sunnyvale, CA
    5 days ago
  • $138k - $198k

    Technical Program Manager II, Hardware Quality and Reliability, Data Centers Google Sunnyvale, CA, USA Qualifications Bachelor's...  ...qualifications Master's degree in electrical, mechanical, or software engineering. 2 years of experience managing cross-functional or cross-... 
    Full time
    Work at office

    Google Inc.

    Sunnyvale, CA
    5 days ago
  • $120.3k - $194.53k

     ...that drives great outcomes. Job Summary Palo Alto Networks runs a large hybrid infrastructure across multiple public clouds. As a Site Reliability Engineer on the Internet Security Platform team, you will be part of a team supporting Advanced DNS Security services. This... 
    Website
    Full time
    Work at office
    Visa sponsorship
    Work visa

    Palo Alto Networks

    Santa Clara, CA
    2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Director, Site Reliability and Software Engineering - DGX Cloud. Be the first to apply!