Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior Platform and EngOps Engineer - Cluster Operations

$176k - $276k
Full-time

NVIDIA

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence. Join our team of innovative engineers who develop and maintain software facilitating GPU communication, driving groundbreaking solutions in High Performance Computing and Deep Learning. We're looking for highly motivated EngOps and Platform Engineers to boost execution efficiency while managing and maintaining large GPU clusters interconnected via NVLink and InfiniBand. What you will be doing: Develop automated tools to efficiently deploy, provision, and maintain extensive GPU clusters interconnected via NVLink and InfiniBand Implement modern DevOps tools to automate software updates, perform maintenance tasks, and monitor cluster availability, ensuring seamless operations. Take ownership of daily cluster failures and issues, troubleshooting them promptly to maintain optimal cluster availability and performance. Manage the rollout and rollback of cluster software and firmware updates, ensuring smooth transitions and minimal disruptions. Collaborate effectively with dynamic Engineering and Product Teams across multiple time zones to align cluster operations with evolving project requirements. What we need to see: BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent experience. 8+ years of hands-on experience in deploying and administrating clusters, servers, switches, and related infrastructure. Automation expert with hands on skills in Ansible, Python and Shell Scripting. Deep understanding of operating systems, computer networks, and high-performance applications. Proven ability to work effectively with developers and test engineers across different teams and time zones. Proficient with Linux fundamentals. Ways to stand out from the crowd: Familiarity with resource scheduling managers, preferably Slurm. Direct experience with industry standard alerting tools and emergency response practices. Hands-on experience with GPU-focused hardware and software, such as DGX systems and Compute Clusters. Proficiency in crafting and implementing a robust metrics collection and alerting infrastructure. Proficiency in designing large scale networking technologies and the associated challenges. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 176,000 USD - 276,000 USD for Level 4, and 208,000 USD - 333,500 USD for Level 5. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until July 2, 2026. This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. NVIDIA pioneered accelerated computing. Today, our AI infrastructure powers global intelligence, transforming every industry. Learn more about

NVIDIA.

Vacancy posted 2 days ago
Similar jobs that could be interesting for youBased on the Senior Platform and EngOps Engineer - Cluster Operations in Santa Clara, CA vacancy
  • $165k - $242k

     ...the Team The Business Systems Engineering team partners closely with Data Center Operations, Infrastructure, Facilities, and...  ...footprint. This team owns the platforms and integrations that enable asset...  ..., operate and scale Kubernetes clusters supporting operational... 
    Platform
    Operations
    Senior
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Remote work
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    5 days ago
  •  ...funded Silicon Valley startup that has built the leading operational intelligence platform for digital infrastructure. By adopting an AI/ML-based...  ...driven conversational user experience, and automated data engineering pipelines. Our solutions are used by leading Banking,... 
    Platform
    Operations
    Senior

    Selector Software

    Santa Clara, CA
    3 days ago
  •  ...Senior Technical Program Manager Hardware Infrastructure...  ...bringup, decom and operations and modernization of...  ...a TPM to guide engineering roadmaps to be delivered...  ...and processes to scale cluster build operations and management...  ...to provide a Platform as a Product offering... 
    Platform
    Operations
    Senior

    NVIDIA

    Santa Clara, CA
    1 day ago
  • $174k - $253k

    Senior Engineer, GDC, Lifecycle Management, Supply Chain Platform Google - Sunnyvale, CA, USA Bachelor's degree or equivalent practical experience. 5 years of experience...  ...New Product Introduction (NPI), Infrastructure, Operations, and others. US: $174,000 - $253,000 (USD) + 15... 
    Platform
    Operations
    Senior

    Google Inc.

    Sunnyvale, CA
    6 days ago
  • $139k - $242k

     ...Senior Security Production Engineer Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA / San...  ...by pioneers, CoreWeave delivers a platform of technology, tools, and teams that...  ...footprint, enabling safe and efficient operations for enterprise and AI workloads at... 
    Platform
    Operations
    Senior
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    5 days ago
  •  ...Department of Defense, is looking for a Senior Systems Engineer to work in our Communications...  ...performance computing and often simultaneous operation of these capabilities. Who are we...  ...processing on 3U OpenVPX CMOSS/SOSA aligned platforms. Pacific Defense is developing a... 
    Platform
    Operations
    Senior
    Immediate start
    Flexible hours

    Pacific DEFENSE Inc.

    Sunnyvale, CA
    2 days ago
  •  ...healthcare, the home, and beyond. We operate at the cutting edge of embodied AI, applying...  ...the world for the better. As a Senior Software Engineer on the Autonomy team at Apptronik, you...  ...Controls, Reinforcement Learning, and Platform teams, and help shape Apptronik’s long... 
    Platform
    Operations
    Senior
    Local area

    Booster

    Sunnyvale, CA
    2 days ago
  • $126k - $204.5k

     ...delivers the industry's most advanced SecOps platform, consisting of XDR, XSIAM, XSOAR, and...  ...Cortex DevOps team, your role involves operating and maintaining a large-scale GCP...  ...you will collaborate closely with our engineering teams to develop innovative solutions that... 
    Platform
    Operations
    Senior
    Full time
    Work at office

    Palo Alto Networks

    Santa Clara, CA
    4 days ago
  • Senior Manufacturing Design Engineer, Methods and Standards Archer is an aerospace company based in San Jose...  ...the next‑generation Midnight 1.1 platform. As a member of the Manufacturing Design...  ..., Quality, Facilities, and Operations to ensure standards are practical,... 
    Platform
    Operations
    Senior
    Local area
    Night shift

    Archer

    Santa Clara, CA
    4 days ago
  • $100k - $125k

     ...Salary: $100,000 - $125,000 per year Senior Project Engineer Reports to:Project Manager...  ...software, and construction management platforms (CMiC or similar) ~ Ability to manage...  ...manager headquartered in San Jose with operations throughout the greater Bay Area and... 
    Platform
    Operations
    Senior
    Full time
    For contractors
    For subcontractor
    Work at office

    Blach Construction

    San Jose, CA
    4 days ago
  • $160k - $220k

     ...Description Matternet designs, builds, and operates autonomous drone networks for fast,...  ...-emission delivery. We’re seeking a Senior Mechanical Engineer to lead the design, prototyping,...  ...latching/locking mechanisms, landing-platform interfaces and FOD/propeller-safety features... 
    Platform
    Operations
    Senior
    Flexible hours

    Matternet

    Mountain View, CA
    15 days ago
  •  ...aircraft architecture and scalable platform have been flying for over 10 years....  ...seeking a highly skilled and motivated Senior Electro-Mechanical Engineer to join our team. In this position...  ...by embedding safety into daily operations, identifying and mitigating risks... 
    Platform
    Operations
    Senior
    Work at office

    Pivotal

    Palo Alto, CA
    9 days ago
  •  ...You are a highly experienced engineering professional with a passion...  ...timely delivery and robust test platforms. Maintaining clear...  ...R&D, Test Development, and Operations, leading to innovative solutions...  .... 8+ years of experience in senior manufacturing test engineering... 
    Platform
    Operations
    Senior
    Contract work
    Local area
    Shift work

    Synopsys, Inc.

    Sunnyvale, CA
    6 days ago
  • $174k - $252k

    Senior Software Engineer, Embedded Systems/Firmware, AI and Infrastructure Sunnyvale, CA, USA Bachelor...  ...of experience working with embedded operating systems. Preferred qualifications:...  ...the next generation of Google platforms, we make Google's product portfolio possible... 
    Platform
    Operations
    Senior
    Full time
    Worldwide

    Google Inc.

    Sunnyvale, CA
    3 days ago
  • $176k - $276k

    Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build...  ...such as limiting time spent on reactive operational work, blameless postmortems and...  ...scale Observability & Telemetry collection platform with a focus on performance at scale, real... 
    Platform
    Operations
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    4 days ago
  • $139k - $204k

     ...pioneers, CoreWeave delivers a platform of technology, tools, and...  ...: We're looking for a Senior Storage Engineer, Control Plane to play a...  ...in designing, building, and operating the control plane for our high...  ...dedicated storage clusters into diverse customer environments... 
    Platform
    Operations
    Senior
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    21 days ago
  • $113.67k - $153.8k

     ...Integrity Associates (SIA) is seeking a Senior Mechanical Engineer with expertise in thermo-fluid...  ...with major OEM steam and gas turbine platforms. The successful candidate will apply...  ...STAR-CCM+) to evaluate flow behavior, operating conditions, performance characteristics... 
    Platform
    Operations
    Senior
    Temporary work
    Casual work
    Flexible hours

    SI Solutions, LLC

    San Jose, CA
    8 days ago
  • $159k - $231k

    Senior Hardware Power Test Engineer, Platforms Infrastructure Google, Sunnyvale, CA, USA Overview As a Senior Hardware Power Test Engineer, you will be...  ...Knowledge of power measurement equipment and tools and the operation of DC‑DC converters. Experience with bench‑level... 
    Platform
    Operations
    Senior
    Full time

    Google Inc.

    Sunnyvale, CA
    5 days ago
  •  ...Senior Systems Engineer US - Milpitas About Us Graphcore is one of the world's leading innovators...  ...Hardware Engineer to provide advanced operational, diagnostic, and engineering support for Graphcore's Arm-based hardware platforms across lab and data center... 
    Platform
    Operations
    Senior
    Flexible hours

    Graphcore

    Milpitas, CA
    1 day ago
  • $142k - $165k

     ...tilt-aircraft architecture and scalable platform have been flying for over 10 years....  ...a highly experienced and motivated Senior Mechanical Engineer to join our team. In this position you...  ...mindset by embedding safety into daily operations, identifying and mitigating risks... 
    Platform
    Operations
    Senior
    Work at office

    Medium

    Palo Alto, CA
    2 days ago
  • $142.7k - $270.95k

     ...Photoshop ART is seeking a Senior researcher - Machine...  ...Systems & Efficiency Engineer to join our R&D team...  ...including techniques such as operator fusion and graph-level...  .... Containerization & Cluster Operations:...  ...create through innovative platforms and tools that unleash... 
    Platform
    Operations
    Senior
    Full time
    Temporary work
    Local area
    Worldwide

    Adobe

    San Jose, CA
    1 day ago
  • $188k - $275k

     ...pioneers, CoreWeave delivers a platform of technology, tools, and...  ...What You'll Do: The Field Engineering organization at CoreWeave is...  ...alongside the teams that build and operate each layer, you are the deep...  ...lifecycle: leading new GPU cluster bring-up and acceptance,... 
    Platform
    Operations
    Senior
    Permanent employment
    Contract work
    Temporary work
    Casual work
    Work at office
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    20 days ago
  •  ...of patients worldwide. We’re a team of engineers, clinicians, and innovators united by one...  ...to architecture of manufacturing platform and applications. The position will require...  ...enhance productivity Understand products’ operations and controls, and develop the means to ensure... 
    Platform
    Operations
    Senior
    Full time
    Local area
    Worldwide
    Flexible hours
    Shift work

    Intuitive

    Sunnyvale, CA
    1 day ago
  • $150k - $200k

     ...Senior Electrical Engineer $150000 - $200000 per year | Menlo Park, CA | On-Site | Permanent Cutting-Edge Advanced Space Systems...  ...designed to support next-generation satellite operations and radar-based systems. Our platform enables enhanced visibility and mapping of... 
    Platform
    Operations
    Senior
    Permanent employment
    Local area

    Australia-Employment

    Menlo Park, CA
    1 day ago
  • $179k - $218k

     ...the ground up, we own and operate each layer of the stack...  .... We are seeking a Senior Staff Data Center Operations Engineer, GPU Hardware Architecture...  ...authority on GPU platforms within the Data Center Engineering...  ...needed to maintain peak cluster health.   The... 
    Platform
    Operations
    Senior
    Temporary work

    Crusoe

    Sunnyvale, CA
    26 days ago
  • Senior Identity & Access Management Engineer - Moveworks Company Description Who we are Moveworks is the Agentic AI Assistant platform that empowers the entire workforce. Our platform enables employees...  ...tasks and streamline business operations. Recognized on the Forbes... 
    Platform
    Operations
    Senior
    Work at office
    Remote work
    Flexible hours

    Moveworks

    Mountain View, CA
    2 days ago
  • $132k - $207k

     ...NVIDIA is seeking a highly skilled QA Engineer to join our Workstation and Virtualization...  ...experience in optimizing virtualization platforms (VMware ESXi, Citrix Hypervisor, Microsoft...  ..., supercomputers, and computer clusters, including caches, buses, memory controllers... 
    Platform
    Senior
    Remote work
    Flexible hours

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • $180k - $260k

     ...integration into customers' logistics operations. About the role We are seeking an experienced Senior/Staff Site Reliability Engineer to support the operation, monitoring, and...  ...closely with our infrastructure and platform teams to manage rollouts of both on-premises... 
    Platform
    Operations
    Senior
    Odd job
    Work at office
    Remote work

    Gatik AI

    Mountain View, CA
    5 days ago
  • $176.4k - $319.72k

     ...Nuro gives the automakers and mobility platforms a clear path to AVs at commercial...  ...connected future. About the Role As a Senior/Staff Systems Engineer working on Autonomy Verification,...  ...Software, Simulation, Product, and Operations. You will have end‑to‑end ownership... 
    Platform
    Operations
    Senior
    Odd job
    Work experience placement

    Kindredventures

    Mountain View, CA
    5 days ago
  • $254.5k

     ...It all started when engineer Fred Luddy wrote code...  ...reinvention. Our ServiceNow AI platform brings together any AI...  ...are built and operated at scale. • You will...  ...scale of hundreds of clusters and dozens of product...  ...engineers, and other senior technical leaders to drive... 
    Platform
    Operations
    Work at office
    Immediate start
    Remote work
    Flexible hours

    ServiceNow

    Santa Clara, CA
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Platform and EngOps Engineer - Cluster Operations. Be the first to apply!