Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior Platform and EngOps Engineer - Cluster Operations

$176k - $276k

NVIDIA Corporation

  • # Senior Platform and EngOps Engineer - Cluster OperationsApplylocations: US, CA, Santa Claratime type: Full timeposted on: Posted Yesterdayjob requisition id: JR2014633NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence.Join our team of innovative engineers who develop and maintain software facilitating GPU communication, driving groundbreaking solutions in High Performance Computing and Deep Learning. We're looking for highly motivated EngOps and Platform Engineers to boost execution efficiency while managing and maintaining large GPU clusters interconnected via NVLink and InfiniBand.**What you will be doing:*** Develop automated tools to efficiently deploy, provision, and maintain extensive GPU clusters interconnected via NVLink and InfiniBand* Implement modern DevOps tools to automate software updates, perform maintenance tasks, and monitor cluster availability, ensuring seamless operations.* Take ownership of daily cluster failures and issues, troubleshooting them promptly to maintain optimal cluster availability and performance.* Manage the rollout and rollback of cluster software and firmware updates, ensuring smooth transitions and minimal disruptions.* Collaborate effectively with dynamic Engineering and Product Teams across multiple time zones to align cluster operations with evolving project requirements.**What we need to see:*** BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent experience.* 8+ years of hands-on experience in deploying and administrating clusters, servers, switches, and related infrastructure.* Automation expert with hands on skills in Ansible, Python and Shell Scripting.* Deep understanding of operating systems, computer networks, and high-performance applications.* Proven ability to work effectively with developers and test engineers across different teams and time zones.* Proficient with Linux fundamentals.**Ways to stand out from the crowd:*** Familiarity with resource scheduling managers, preferably Slurm.* Direct experience with industry standard alerting tools and emergency response practices.* Hands-on experience with GPU-focused hardware and software, such as DGX systems and Compute Clusters.* Proficiency in crafting and implementing a robust metrics collection and alerting infrastructure.* Proficiency in designing large scale networking technologies and the associated challenges.Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 176,000 USD - 276,000 USD for Level 4, and 208,000 USD - 333,500 USD for Level 5.You will also be eligible for equity and benefits.Applications for this job will be accepted at least until June 17, 2026.This posting is for an existing vacancy.NVIDIA uses AI tools in its recruiting processes.NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
  • J-18808-Ljbffr NVIDIA Corporation

Vacancy posted 2 days ago
Similar jobs that could be interesting for youBased on the Senior Platform and EngOps Engineer - Cluster Operations in Santa Clara, CA vacancy
  • NVIDIA Corporation in Santa Clara, CA is seeking a Senior Platform and EngOps Engineer for Cluster Operations. You will develop automated tools to manage GPU clusters and implement modern DevOps practices to ensure operational efficiency. The ideal candidate has a strong... 
    Platform
    Operations
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    2 days ago
  •  ...You will be building an AI Data Center AIOps platform that turns raw, high‑volume telemetry into...  ...for GPU fleets. Join our team of innovative engineers who are building this platform and operating it (not the compute cluster): uptime, performance, data integrity, and safe... 
    Platform
    Operations
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • $152k - $241.5k

     ...Overview We’re looking for a Senior SRE to join our...  ...of our global services platform. At NVIDIA, you’ll...  ...and implementation to operation and continuous improvement...  ...large‑scale HPC clusters using Slurm, LSF or Kubernetes...  ...Ruby. Mentored other engineers and influenced... 
    Platform
    Operations
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    5 days ago
  • $165k - $242k

     ...the Team The Business Systems Engineering team partners closely with Data Center Operations, Infrastructure, Facilities, and...  ...footprint. This team owns the platforms and integrations that enable asset...  ..., operate and scale Kubernetes clusters supporting operational... 
    Platform
    Operations
    Senior
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Remote work
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    4 days ago
  •  ...Senior Technical Program Manager Hardware Infrastructure...  ...bringup, decom and operations and modernization of...  ...a TPM to guide engineering roadmaps to be delivered...  ...and processes to scale cluster build operations and management...  ...to provide a Platform as a Product offering... 
    Platform
    Operations
    Senior

    NVIDIA

    Santa Clara, CA
    19 hours ago
  • $168k - $270.25k

    At NVIDIA, we operate at the core of enterprise security, architecting...  ...the most advanced computing platforms in the world. This role...  ...working alongside outstanding engineers and security leaders. NVIDIA'...  ...Security organization is seeking a Senior Cybersecurity Engineer -... 
    Platform
    Operations
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • $174k - $253k

    Senior Engineer, GDC, Lifecycle Management, Supply Chain Platform Google - Sunnyvale, CA, USA Bachelor's degree or equivalent practical experience. 5 years of experience...  ...New Product Introduction (NPI), Infrastructure, Operations, and others. US: $174,000 - $253,000 (USD) + 15... 
    Platform
    Operations
    Senior

    Google Inc.

    Sunnyvale, CA
    5 days ago
  •  ...currently seeking a passionate and driven Sales Engineer to join our exceptional team. As a Sales...  ...environments, ensuring flawless operation and meeting their specific needs. Develop...  ...Enterprise Hardware or Software, Cloud Platforms, IaaS, PaaS, or Virtual Infrastructure... 
    Platform
    Operations
    Senior
    Remote job
    Flexible hours

    Cohesity Inc.

    Santa Clara, CA
    1 day ago
  •  ...well as a strong company culture. Sales Engineer - Cohesity Job Responsibilities Uncover...  ...environments, ensuring flawless operation and meeting their specific needs. Develop...  ...Enterprise Hardware or Software, Cloud Platforms, IaaS, PaaS, or Virtual Infrastructure software... 
    Platform
    Operations
    Senior
    Remote job
    Work at office
    Worldwide
    Flexible hours
    2 days per week
    3 days per week

    Cohesity Inc.

    Santa Clara, CA
    4 days ago
  • $132k - $207k

    We are looking to hire a System Test Engineer who will work in the test solutions group at...  ...on-site support for builds and factory operations as necessary. What we need to see 5+ years...  ...manufacturing test programs at the platform level. Demonstrated experience with data... 
    Platform
    Operations
    Senior
    Local area
    Overseas

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  •  ...Europe Role Overview Seeking a Senior Site Reliability Engineer / DevOps Engineer to design, scale, and operate highly available global...  ...operating Kubernetes and cloud platforms at scale. The ideal...  ...troubleshoot production Kubernetes clusters Handle cluster lifecycle... 
    Platform
    Operations
    Senior

    Prophet Town

    Mountain View, CA
    4 days ago
  • $181.1k - $318.4k

     ...thousands of accelerators creates challenges that few engineers ever encounter. In Apple’s Machine Learning Platform Technologies organization, we build the...  ...analysis and implement processes to ensure optimal operation and growth of our infrastructure. This includes working... 
    Platform
    Operations
    Senior
    Relocation

    Apple Inc.

    Santa Clara, CA
    5 days ago
  •  ...Department of Defense, is looking for a Senior Systems Engineer to work in our Communications...  ...performance computing and often simultaneous operation of these capabilities. Who are we...  ...processing on 3U OpenVPX CMOSS/SOSA aligned platforms. Pacific Defense is developing a... 
    Platform
    Operations
    Senior
    Immediate start
    Flexible hours

    Pacific DEFENSE Inc.

    Sunnyvale, CA
    1 day ago
  • $175k - $275k

     ...the Role We are looking for a hands-on Senior Quality Engineer to drive Manufacturing Quality across...  ...Engineering, Manufacturing Operations, and our CMs to establish a “quality...  ...Join Cerebras Build a breakthrough AI platform beyond the constraints of the GPU. Publish... 
    Platform
    Operations
    Senior
    Contract work

    Cerebras

    Sunnyvale, CA
    2 days ago
  •  ...healthcare, the home, and beyond. We operate at the cutting edge of embodied AI, applying...  ...the world for the better. As a Senior Software Engineer on the Autonomy team at Apptronik, you...  ...Controls, Reinforcement Learning, and Platform teams, and help shape Apptronik’s long... 
    Platform
    Operations
    Senior
    Local area

    Booster

    Sunnyvale, CA
    1 day ago
  • $144k - $209k

    Senior Manufacturing Engineer, Global Manufacturing Operations X Note: By applying to this position you will have an opportunity to share your preferred working...  ...10 (System Assembly), and L11 (Rack Integration) platforms. Moving beyond traditional sustaining engineering... 
    Platform
    Operations
    Senior
    Contract work
    Worldwide

    Google Inc.

    Sunnyvale, CA
    2 days ago
  • $147k - $216k

    About the job As a Senior Hardware Engineer, you will work on ML/AI hardware systems projects to craft...  ..., and providing the essential platforms that enable developers to build the future...  ...Global Networking, Data Center operations, systems research, and much more. Responsibilities... 
    Platform
    Operations
    Senior
    Full time
    Worldwide

    Google

    Sunnyvale, CA
    2 days ago
  • $224k - $356.5k

    At NVIDIA, our Financial Systems Engineering team is at the heart of ensuring that our massive scale operates with zero friction. We are responsible for architecting and...  ...data pipelines using high-volume streaming platforms (Kafka) and processing frameworks (Spark or Flink... 
    Platform
    Operations
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • Senior Manufacturing Design Engineer, Methods and Standards Archer is an aerospace company based in San Jose...  ...the next‑generation Midnight 1.1 platform. As a member of the Manufacturing Design...  ..., Quality, Facilities, and Operations to ensure standards are practical,... 
    Platform
    Operations
    Senior
    Local area
    Night shift

    Archer

    Santa Clara, CA
    3 days ago
  • $100k - $125k

     ...Salary: $100,000 - $125,000 per year Senior Project Engineer Reports to:Project Manager...  ...software, and construction management platforms (CMiC or similar) ~ Ability to manage...  ...manager headquartered in San Jose with operations throughout the greater Bay Area and... 
    Platform
    Operations
    Senior
    Full time
    For contractors
    For subcontractor
    Work at office

    Blach Construction

    San Jose, CA
    3 days ago
  • $200k - $322k

    Senior Manager, Site Reliability Engineering page is loaded## Senior Manager, Site Reliability Engineeringlocations...  ...to lead and reshape how IT operations function at scale. This role goes...  ...development of automation and orchestration platforms that reduce manual effort across... 
    Platform
    Operations
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    4 days ago
  • Job Overview The Senior Data Center Operations Engineer plays a critical, hands‑on role in supporting the build‑out and long‑term operation of a high...  ...Diagnose and resolve complex hardware failures across server platforms (motherboards, CPUs, memory, storage) Perform... 
    Platform
    Operations
    Senior
    Local area

    Milestone Technologies, Inc.

    Sunnyvale, CA
    5 days ago
  • $160k - $220k

     ...Description Matternet designs, builds, and operates autonomous drone networks for fast,...  ...-emission delivery. We’re seeking a Senior Mechanical Engineer to lead the design, prototyping,...  ...latching/locking mechanisms, landing-platform interfaces and FOD/propeller-safety features... 
    Platform
    Operations
    Senior
    Flexible hours

    Matternet

    Mountain View, CA
    14 days ago
  • $272k - $431.25k

    Overview NVIDIA is seeking a Senior MLOps Engineering Manager to join our Autonomous Driving organization...  ...to lead the build, development, and operation of large‑scale, end‑to‑end data and...  ...infrastructure, CI/CD, and data platforms. What We Need to See Bachelor’s or equivalent... 
    Platform
    Operations
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    5 days ago
  • $108k - $145k

     ...a **Virtual Design & Construction, Senior Project VDC Engineer**. A successful candidate will lead...  ...(if applicable) or other equitable platforms* Support setup of VDC field equipment...  ...where applicable (laser scanning, drone operations, RTLS, etc.)* Assist VDC Manager in... 
    Platform
    Operations
    Senior
    Contract work
    For contractors
    Work at office
    Flexible hours

    DPR Construction

    Santa Clara, CA
    1 day ago
  •  ...aircraft architecture and scalable platform have been flying for over 10 years....  ...seeking a highly skilled and motivated Senior Electro-Mechanical Engineer to join our team. In this position...  ...by embedding safety into daily operations, identifying and mitigating risks... 
    Platform
    Operations
    Senior
    Work at office

    Pivotal

    Palo Alto, CA
    8 days ago
  • $129.4k - $198.4k

    Role As a Senior Autonomy Behavior Validation Engineer on the Software Validation team within the AV organization...  ...of the development, validation, or operations lifecycle. Experience working...  ...Looker,Jupyternotebooks, or similar platforms to communicate validation results... 
    Platform
    Operations
    Senior
    Flexible hours

    General Motors

    Sunnyvale, CA
    4 days ago
  • $126k - $204.5k

     ...delivers the industry’s most advanced SecOps platform, consisting of XDR, XSIAM, XSOAR, and...  ...Cortex DevOps team, your role involves operating and maintaining a large‑scale GCP...  ...you will collaborate closely with our engineering teams to develop innovative solutions that... 
    Platform
    Operations
    Senior

    Palo Alto Networks, Inc.

    Santa Clara, CA
    5 days ago
  •  ...You are a highly experienced engineering professional with a passion...  ...timely delivery and robust test platforms. Maintaining clear...  ...R&D, Test Development, and Operations, leading to innovative solutions...  .... 8+ years of experience in senior manufacturing test engineering... 
    Platform
    Operations
    Senior
    Contract work
    Local area
    Shift work

    Synopsys, Inc.

    Sunnyvale, CA
    5 days ago
  • $175k - $263k

     ...innovative, high‑availability storage platforms. You will lead the development of mission...  ...control‑plane protocols—that simplify operations for global enterprise customers. Lead...  ...ASIC capabilities. Availability‑Focused Engineering: Understanding of non‑disruptive... 
    Platform
    Operations
    Senior
    Work at office
    Flexible hours

    Pure Storage

    Santa Clara, CA
    4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Platform and EngOps Engineer - Cluster Operations. Be the first to apply!