Senior Platform and EngOps Engineer - Cluster Operations
$176k - $276kNVIDIA
NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence.Join our team of innovative engineers who develop and maintain software facilitating GPU communication, driving groundbreaking solutions in High Performance Computing and Deep Learning. We're looking for highly motivated EngOps and Platform Engineers to boost execution efficiency while managing and maintaining large GPU clusters interconnected via NVLink and InfiniBand.What you will be doing:Develop automated tools to efficiently deploy, provision, and maintain extensive GPU clusters interconnected via NVLink and InfiniBandImplement modern DevOps tools to automate software updates, perform maintenance tasks, and monitor cluster availability, ensuring seamless operations.Take ownership of daily cluster failures and issues, troubleshooting them promptly to maintain optimal cluster availability and performance.Manage the rollout and rollback of cluster software and firmware updates, ensuring smooth transitions and minimal disruptions.Collaborate effectively with dynamic Engineering and Product Teams across multiple time zones to align cluster operations with evolving project requirements.What we need to see:BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent experience.8+ years of hands-on experience in deploying and administrating clusters, servers, switches, and related infrastructure.Automation expert with hands on skills in Ansible, Python and Shell Scripting.Deep understanding of operating systems, computer networks, and high-performance applications.Proven ability to work effectively with developers and test engineers across different teams and time zones.Proficient with Linux fundamentals.Ways to stand out from the crowd:Familiarity with resource scheduling managers, preferably Slurm.Direct experience with industry standard alerting tools and emergency response practices.Hands-on experience with GPU-focused hardware and software, such as DGX systems and Compute Clusters.Proficiency in crafting and implementing a robust metrics collection and alerting infrastructure.Proficiency in designing large scale networking technologies and the associated challenges.Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 176,000 USD - 276,000 USD for Level 4, and 208,000 USD - 333,500 USD for Level 5.You will also be eligible for equity and benefits.Applications for this job will be accepted at least until May 18, 2026.This posting is for an existing vacancy.NVIDIA uses AI tools in its recruiting processes.NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. #J-18808-Ljbffr NVIDIA Corporation
$176k - $333.5k
NVIDIA Corporation in Santa Clara is seeking experienced EngOps and Platform Engineers to develop and maintain extensive GPU clusters. The role requires extensive hands-on experience with automation tools and a robust understanding of computer networks. The ideal candidate...PlatformSenior- ...Join our team of innovative engineers who develop and maintain... ...looking for highly motivated EngOps and Platform Engineers to boost execution... ...and maintaining large GPU clusters interconnected via NVLink and... ..., ensuring seamless operations. Take ownership of daily cluster...PlatformOperationsSenior
$152k - $241.5k
...We are seeking a highly skilled and experienced HPC Cluster Engineer to design, deploy, and operate GPU Compute Clusters for EDA (Electronic Design Automation... ...the most groundbreaking and powerful compute platforms for the world to use. It’s because of our work that...PlatformOperationsSeniorFull time- ...NVIDIA Gruppe is seeking highly motivated EngOps and Platform Engineers to develop automated tools for managing large GPU clusters. This position requires strong expertise in high-performance computing and deep learning. The ideal applicants have a BS or MS in a relevant...PlatformSenior
$139k - $204k
...Senior Engineer, Storage Control Plane Livingston, NJ / New York, NY... ...pioneers, CoreWeave delivers a platform of technology, tools, and... ...in designing, building, and operating the control plane for our high... ...integrate dedicated storage clusters into diverse customer...PlatformOperationsSeniorPermanent employmentTemporary workCasual workWork at officeFlexible hours- ...NVIDIA is searching for a highly skilled HPC Cluster Engineer to design, deploy, and operate GPU Compute Clusters for Electronic Design Automation and high-performance computing workloads across multiple teams and projects. The role collaborates with researchers and infrastructure...OperationsSenior
$184k - $287.5k
...DGX Cloud is building and operating large-scale GPU infrastructure... .... We are looking for Senior Software Engineers to help build the automation... ...operational systems that make GPU clusters reliable, scalable, and... ...follow-up work. Partner with platform, storage, networking,...PlatformOperationsSenior- ...You will be building an AI Data Center AIOps platform that turns raw, high‑volume telemetry into... ...for GPU fleets. Join our team of innovative engineers who are building this platform and operating it (not the compute cluster): uptime, performance, data integrity, and safe...PlatformOperationsSenior
$139k - $204k
...Senior Software Engineer, Storage Engineer CoreWeave is The Essential Cloud... ...pioneers, CoreWeave delivers a platform of technology, tools, and... ...dedicated storage clusters into diverse customer environments... ...stack. Collaborate with operations teams to monitor, troubleshoot...PlatformOperationsSeniorPermanent employmentTemporary workCasual workWork at officeFlexible hours$179k - $218k
...the ground up, we own and operate each layer of the stack... .... We are seeking a Senior Staff Data Center Operations Engineer, GPU Hardware Architecture... ...authority on GPU platforms within the Data Center Engineering... ...needed to maintain peak cluster health. The...PlatformOperationsSeniorTemporary work$152k - $241.5k
...Overview We’re looking for a Senior SRE to join our... ...of our global services platform. At NVIDIA, you’ll... ...and implementation to operation and continuous improvement... ...large‑scale HPC clusters using Slurm, LSF or Kubernetes... ...Ruby. Mentored other engineers and influenced...PlatformOperationsSenior$237k - $329k
Senior Hardware Engineering Manager, Hardware Innovation Apply Benefits for this... ...of experience building or operating AI accelerator hardware or... ...role, you will build AI clusters using the latest technologies... ...providing the essential platforms that enable developers to...PlatformOperationsSeniorFull timeTemporary workWorldwide$148k - $235.75k
## Senior Site Reliability EngineerApplylocations:... ...working as a Senior SRE Engineer. The position will be... ...services on Kubernetes clusters. Implement logging,... ...Grafana, Splunk, or similar platforms, applied to analyze... ..., managing, and operating containerized workloads...PlatformOperationsSeniorRemote work$165k - $242k
...Senior Business Systems Engineer- Data Center Systems II Livingston, NJ / Bellevue... ...pioneers, CoreWeave delivers a platform of technology, tools, and... ...closely with Data Center Operations, Infrastructure,... ...operate, and scale Kubernetes clusters supporting operational workloads...PlatformOperationsSeniorTemporary workCasual workWork at officeImmediate startFlexible hours$131k - $175k
...awards, such as Best Engineering Team, Best Company for... ...ll Work With As a Senior Rack Solution Engineer... ...execution, ensuring Arista platforms integrate cleanly into... ...-scale AI and cloud clusters. What You'll Do... ...solution briefs) used by operators and customers...PlatformSeniorRemote workFlexible hours$152k - $241.5k
...We are now looking for a Senior Software Engineer to help accelerate the next... ...and performance‑optimal GPU clusters to internal researchers,... ...and development by reducing operational disruption and overhead,... ...with coworkers across the AI Platform organization to understand...PlatformOperationsSenior- ...currently seeking a passionate and driven Sales Engineer to join our exceptional team. As a Sales... ...environments, ensuring flawless operation and meeting their specific needs. Assist... ...Enterprise Hardware or Software, Cloud Platforms, IaaS, PaaS, or Virtual Infrastructure software...PlatformOperationsSeniorRemote workFlexible hours
- ...Senior High-Speed IO Validation Engineer at NVIDIA As a senior engineer, you will lead the full life cycle of... ...level on GPU‑accelerated computing platforms. Provide engineering support for... ...InfiniBand, Ethernet), specification, and operation at a system level. Experience...PlatformOperationsSenior
- ...currently seeking a passionate and driven Sales Engineer to join our exceptional team. As a Sales... ...environments, ensuring flawless operation and meeting their specific needs. Develop... ...Enterprise Hardware or Software, Cloud Platforms, IaaS, PaaS, or Virtual Infrastructure software...PlatformOperationsSeniorRemote workFlexible hours
- ...well as a strong company culture. Sales Engineer – Cohesity Job Responsibilities Uncover... ...production environments, ensuring flawless operation and meeting their specific needs.... ...Enterprise Hardware or Software, Cloud Platforms, IaaS, PaaS, or Virtual Infrastructure software...PlatformOperationsSeniorWork at officeRemote workWorldwideFlexible hours2 days per week3 days per week
$168k - $270.25k
## Senior Cybersecurity Engineer - Identity GovernanceApplylocations: US, CA, Santa Clara: US, CA, Remotetime... ...id: JR2018882At NVIDIA, we operate at the core of enterprise security, architecting... ...some of the most advanced computing platforms in the world. This role offers the...PlatformOperationsSenior$140k - $160k
...ASRC Federal is looking for a Senior HPC Engineer, as ASRC Federal InuTeq... ...architecture, acquisition, and operations for federal government... ...Design, deploy and maintain HPC clusters with over 2000+ nodes with... ...in heterogeneous, multi-platform HPC environments Strong...PlatformOperationsSeniorContract workWeekend work$132k - $207k
...We are looking to hire a System Test Engineer who will work in the test solutions group... ...on-site support for builds and factory operations as necessary. What we need to see 5+ years... ...supporting manufacturing test programs at the platform level. Demonstrated experience with data...PlatformOperationsSeniorLocal areaOverseas$170.4k - $275.73k
...hiring a Principal RF Hardware Engineer to serve as a technical... ...work yourself, mentor Staff and Senior engineers, and partner closely... ...desense studies for multi-radio platforms (LTE-WLAN, NR-GNSS, Wi-Fi-BT... .... Guide regulatory and operator certification programs: FCC,...PlatformOperationsSeniorFull timeWork at office- ...Europe Role Overview Seeking a Senior Site Reliability Engineer / DevOps Engineer to design, scale, and operate highly available global... ...operating Kubernetes and cloud platforms at scale. The ideal... ...troubleshoot production Kubernetes clusters Handle cluster lifecycle management...PlatformOperationsSenior
$136k - $212.75k
We are now looking for a Senior High-Speed IO (HSIO) Validation Engineer to join our hardware product team! This is... ...and high-performance computing platforms, driving system-level debug from... ...including link training and system-level operation.Hands-on experience with lab...PlatformOperationsSenior$83.9k - $155.7k
...infectious diseases. We are seeking a passionate Senior Systems Engineer to join our Systems Development Group to... ...Systems Engineer will become an expert on the operation of the Roche Single Molecular Sequencing Platform. As a Senior System Engineer, you will learn...PlatformOperationsSeniorLocal areaRelocation package$185k
...the semiconductor industry, we provide a platform for you to expand your career with... ...mechanical utilities that support semiconductor operations. Troubleshoot mechanical failures,... .... Perform or review practical engineering calculations such as heat load, airflow...PlatformOperationsSeniorFull timeTemporary workFor contractorsH1b$190k - $235k
...healthcare, the home, and beyond. We operate at the cutting edge of embodied AI,... ...the better. JOB SUMMARY As a Senior Software Engineer on the Autonomy team at Apptronik, you... ...Controls, Reinforcement Learning, and Platform teams, and help shape Apptronik's long...PlatformOperationsSeniorLocal area- ...Department of Defense, is looking for a Senior Systems Engineer to work in our Communications... ...performance computing and often simultaneous operation of these capabilities. Who are we... ...on 3U OpenVPX CMOSS/SOSA aligned platforms. Pacific Defense is developing a comprehensive...PlatformOperationsSeniorImmediate startFlexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior Platform and EngOps Engineer - Cluster Operations. Be the first to apply!
- platform developer Santa Clara, CA
- platform engineer Santa Clara, CA
- platform engineering manager Santa Clara, CA
- data platform engineer Santa Clara, CA
- client platform engineer Santa Clara, CA
- senior platform engineer Santa Clara, CA
- senior automation controls engineer Santa Clara, CA
- senior brand designer Santa Clara, CA
- senior cost analyst Santa Clara, CA
- senior business analyst contract Santa Clara, CA


