Senior Platform and EngOps Engineer - Cluster Operations
$176k - $276kNVIDIA
NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence. Join our team of innovative engineers who develop and maintain software facilitating GPU communication, driving groundbreaking solutions in High Performance Computing and Deep Learning. We're looking for highly motivated EngOps and Platform Engineers to boost execution efficiency while managing and maintaining large GPU clusters interconnected via NVLink and InfiniBand. What you will be doing: Develop automated tools to efficiently deploy, provision, and maintain extensive GPU clusters interconnected via NVLink and InfiniBand Implement modern DevOps tools to automate software updates, perform maintenance tasks, and monitor cluster availability, ensuring seamless operations. Take ownership of daily cluster failures and issues, troubleshooting them promptly to maintain optimal cluster availability and performance. Manage the rollout and rollback of cluster software and firmware updates, ensuring smooth transitions and minimal disruptions. Collaborate effectively with dynamic Engineering and Product Teams across multiple time zones to align cluster operations with evolving project requirements. What we need to see: BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent experience. 8+ years of hands-on experience in deploying and administrating clusters, servers, switches, and related infrastructure. Automation expert with hands on skills in Ansible, Python and Shell Scripting. Deep understanding of operating systems, computer networks, and high-performance applications. Proven ability to work effectively with developers and test engineers across different teams and time zones. Proficient with Linux fundamentals. Ways to stand out from the crowd: Familiarity with resource scheduling managers, preferably Slurm. Direct experience with industry standard alerting tools and emergency response practices. Hands-on experience with GPU-focused hardware and software, such as DGX systems and Compute Clusters. Proficiency in crafting and implementing a robust metrics collection and alerting infrastructure. Proficiency in designing large scale networking technologies and the associated challenges. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 176,000 USD - 276,000 USD for Level 4, and 208,000 USD - 333,500 USD for Level 5. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until July 2, 2026. This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. NVIDIA pioneered accelerated computing. Today, our AI infrastructure powers global intelligence, transforming every industry. Learn more about
NVIDIA.
$165k - $242k
...the Team The Business Systems Engineering team partners closely with Data Center Operations, Infrastructure, Facilities, and... ...footprint. This team owns the platforms and integrations that enable asset... ..., operate and scale Kubernetes clusters supporting operational...PlatformOperationsSeniorPermanent employmentTemporary workCasual workWork at officeRemote workFlexible hours- ...funded Silicon Valley startup that has built the leading operational intelligence platform for digital infrastructure. By adopting an AI/ML-based... ...driven conversational user experience, and automated data engineering pipelines. Our solutions are used by leading Banking,...PlatformOperationsSenior
- ...Senior Technical Program Manager Hardware Infrastructure... ...bringup, decom and operations and modernization of... ...a TPM to guide engineering roadmaps to be delivered... ...and processes to scale cluster build operations and management... ...to provide a Platform as a Product offering...PlatformOperationsSenior
$174k - $253k
Senior Engineer, GDC, Lifecycle Management, Supply Chain Platform Google - Sunnyvale, CA, USA Bachelor's degree or equivalent practical experience. 5 years of experience... ...New Product Introduction (NPI), Infrastructure, Operations, and others. US: $174,000 - $253,000 (USD) + 15...PlatformOperationsSenior$139k - $242k
...Senior Security Production Engineer Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA / San... ...by pioneers, CoreWeave delivers a platform of technology, tools, and teams that... ...footprint, enabling safe and efficient operations for enterprise and AI workloads at...PlatformOperationsSeniorPermanent employmentTemporary workCasual workWork at officeFlexible hours- ...Department of Defense, is looking for a Senior Systems Engineer to work in our Communications... ...performance computing and often simultaneous operation of these capabilities. Who are we... ...processing on 3U OpenVPX CMOSS/SOSA aligned platforms. Pacific Defense is developing a...PlatformOperationsSeniorImmediate startFlexible hours
- ...healthcare, the home, and beyond. We operate at the cutting edge of embodied AI, applying... ...the world for the better. As a Senior Software Engineer on the Autonomy team at Apptronik, you... ...Controls, Reinforcement Learning, and Platform teams, and help shape Apptronik’s long...PlatformOperationsSeniorLocal area
$126k - $204.5k
...delivers the industry's most advanced SecOps platform, consisting of XDR, XSIAM, XSOAR, and... ...Cortex DevOps team, your role involves operating and maintaining a large-scale GCP... ...you will collaborate closely with our engineering teams to develop innovative solutions that...PlatformOperationsSeniorFull timeWork at office- Senior Manufacturing Design Engineer, Methods and Standards Archer is an aerospace company based in San Jose... ...the next‑generation Midnight 1.1 platform. As a member of the Manufacturing Design... ..., Quality, Facilities, and Operations to ensure standards are practical,...PlatformOperationsSeniorLocal areaNight shift
$100k - $125k
...Salary: $100,000 - $125,000 per year Senior Project Engineer Reports to:Project Manager... ...software, and construction management platforms (CMiC or similar) ~ Ability to manage... ...manager headquartered in San Jose with operations throughout the greater Bay Area and...PlatformOperationsSeniorFull timeFor contractorsFor subcontractorWork at office$160k - $220k
...Description Matternet designs, builds, and operates autonomous drone networks for fast,... ...-emission delivery. We’re seeking a Senior Mechanical Engineer to lead the design, prototyping,... ...latching/locking mechanisms, landing-platform interfaces and FOD/propeller-safety features...PlatformOperationsSeniorFlexible hours- ...aircraft architecture and scalable platform have been flying for over 10 years.... ...seeking a highly skilled and motivated Senior Electro-Mechanical Engineer to join our team. In this position... ...by embedding safety into daily operations, identifying and mitigating risks...PlatformOperationsSeniorWork at office
- ...You are a highly experienced engineering professional with a passion... ...timely delivery and robust test platforms. Maintaining clear... ...R&D, Test Development, and Operations, leading to innovative solutions... .... 8+ years of experience in senior manufacturing test engineering...PlatformOperationsSeniorContract workLocal areaShift work
$174k - $252k
Senior Software Engineer, Embedded Systems/Firmware, AI and Infrastructure Sunnyvale, CA, USA Bachelor... ...of experience working with embedded operating systems. Preferred qualifications:... ...the next generation of Google platforms, we make Google's product portfolio possible...PlatformOperationsSeniorFull timeWorldwide$176k - $276k
Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build... ...such as limiting time spent on reactive operational work, blameless postmortems and... ...scale Observability & Telemetry collection platform with a focus on performance at scale, real...PlatformOperationsSenior$139k - $204k
...pioneers, CoreWeave delivers a platform of technology, tools, and... ...: We're looking for a Senior Storage Engineer, Control Plane to play a... ...in designing, building, and operating the control plane for our high... ...dedicated storage clusters into diverse customer environments...PlatformOperationsSeniorPermanent employmentTemporary workCasual workWork at officeFlexible hours$113.67k - $153.8k
...Integrity Associates (SIA) is seeking a Senior Mechanical Engineer with expertise in thermo-fluid... ...with major OEM steam and gas turbine platforms. The successful candidate will apply... ...STAR-CCM+) to evaluate flow behavior, operating conditions, performance characteristics...PlatformOperationsSeniorTemporary workCasual workFlexible hours$159k - $231k
Senior Hardware Power Test Engineer, Platforms Infrastructure Google, Sunnyvale, CA, USA Overview As a Senior Hardware Power Test Engineer, you will be... ...Knowledge of power measurement equipment and tools and the operation of DC‑DC converters. Experience with bench‑level...PlatformOperationsSeniorFull time- ...Senior Systems Engineer US - Milpitas About Us Graphcore is one of the world's leading innovators... ...Hardware Engineer to provide advanced operational, diagnostic, and engineering support for Graphcore's Arm-based hardware platforms across lab and data center...PlatformOperationsSeniorFlexible hours
$142k - $165k
...tilt-aircraft architecture and scalable platform have been flying for over 10 years.... ...a highly experienced and motivated Senior Mechanical Engineer to join our team. In this position you... ...mindset by embedding safety into daily operations, identifying and mitigating risks...PlatformOperationsSeniorWork at office$142.7k - $270.95k
...Photoshop ART is seeking a Senior researcher - Machine... ...Systems & Efficiency Engineer to join our R&D team... ...including techniques such as operator fusion and graph-level... .... Containerization & Cluster Operations:... ...create through innovative platforms and tools that unleash...PlatformOperationsSeniorFull timeTemporary workLocal areaWorldwide$188k - $275k
...pioneers, CoreWeave delivers a platform of technology, tools, and... ...What You'll Do: The Field Engineering organization at CoreWeave is... ...alongside the teams that build and operate each layer, you are the deep... ...lifecycle: leading new GPU cluster bring-up and acceptance,...PlatformOperationsSeniorPermanent employmentContract workTemporary workCasual workWork at officeFlexible hours- ...of patients worldwide. We’re a team of engineers, clinicians, and innovators united by one... ...to architecture of manufacturing platform and applications. The position will require... ...enhance productivity Understand products’ operations and controls, and develop the means to ensure...PlatformOperationsSeniorFull timeLocal areaWorldwideFlexible hoursShift work
$150k - $200k
...Senior Electrical Engineer $150000 - $200000 per year | Menlo Park, CA | On-Site | Permanent Cutting-Edge Advanced Space Systems... ...designed to support next-generation satellite operations and radar-based systems. Our platform enables enhanced visibility and mapping of...PlatformOperationsSeniorPermanent employmentLocal area$179k - $218k
...the ground up, we own and operate each layer of the stack... .... We are seeking a Senior Staff Data Center Operations Engineer, GPU Hardware Architecture... ...authority on GPU platforms within the Data Center Engineering... ...needed to maintain peak cluster health. The...PlatformOperationsSeniorTemporary work- Senior Identity & Access Management Engineer - Moveworks Company Description Who we are Moveworks is the Agentic AI Assistant platform that empowers the entire workforce. Our platform enables employees... ...tasks and streamline business operations. Recognized on the Forbes...PlatformOperationsSeniorWork at officeRemote workFlexible hours
$132k - $207k
...NVIDIA is seeking a highly skilled QA Engineer to join our Workstation and Virtualization... ...experience in optimizing virtualization platforms (VMware ESXi, Citrix Hypervisor, Microsoft... ..., supercomputers, and computer clusters, including caches, buses, memory controllers...PlatformSeniorRemote workFlexible hours$180k - $260k
...integration into customers' logistics operations. About the role We are seeking an experienced Senior/Staff Site Reliability Engineer to support the operation, monitoring, and... ...closely with our infrastructure and platform teams to manage rollouts of both on-premises...PlatformOperationsSeniorOdd jobWork at officeRemote work$176.4k - $319.72k
...Nuro gives the automakers and mobility platforms a clear path to AVs at commercial... ...connected future. About the Role As a Senior/Staff Systems Engineer working on Autonomy Verification,... ...Software, Simulation, Product, and Operations. You will have end‑to‑end ownership...PlatformOperationsSeniorOdd jobWork experience placement$254.5k
...It all started when engineer Fred Luddy wrote code... ...reinvention. Our ServiceNow AI platform brings together any AI... ...are built and operated at scale. • You will... ...scale of hundreds of clusters and dozens of product... ...engineers, and other senior technical leaders to drive...PlatformOperationsWork at officeImmediate startRemote workFlexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior Platform and EngOps Engineer - Cluster Operations. Be the first to apply!
- senior platform engineer Santa Clara, CA
- platform engineering manager Santa Clara, CA
- platform developer Santa Clara, CA
- data platform engineer Santa Clara, CA
- platform engineer Santa Clara, CA
- senior cloud service delivery manager Santa Clara, CA
- senior business analyst contract Santa Clara, CA
- senior product design engineer Santa Clara, CA
- senior game producer Santa Clara, CA
- senior software manager Santa Clara, CA


