Senior HPC Cluster Engineer

NVIDIA Gruppe

NVIDIA is searching for a highly skilled HPC Cluster Engineer to design, deploy, and operate GPU Compute Clusters for Electronic Design Automation and high-performance computing workloads across multiple teams and projects. The role collaborates with researchers and infrastructure teams to ensure clusters are performant, scalable, and reliable. What you'll be doing: Develop and enhance our ecosystem around GPU-accelerated computing, including developing scalable automation solutions. Continuously improve infrastructure provisioning, management, observability, and day-to-day operation through automation. Provide technical leadership and strategic guidance for managing large-scale HPC systems, including deployment of compute, networking, and storage. Foster strong customer and multi-functional partnerships to ensure consistent cluster support and rapidly adapt to evolving user needs. Support researchers to run their EDA workloads, including performance analysis and optimizations. Conduct root cause analysis and suggest corrective action; proactively find and fix issues before they occur. Build innovative tooling to accelerate researchers' velocity, debugging, and software performance at scale. What we need to see: Bachelor’s degree in Computer Science, Electrical Engineering, or related field, or equivalent experience. Minimum of 5 years of proven experience crafting and operating large-scale compute infrastructure, including cluster configuration management tools such as BCM or Ansible. Experience with AI/HPC job schedulers and orchestrators such as Slurm, LSF, PBS or K8s, and applied experience with AI/HPC workflows that use MPI and NCCL. Proficiency in using Linux including Rocky/Centos/RHEL and/or Ubuntu Linux distributions, with a solid understanding of container technologies such as Enroot and Docker. Proficiency in Python and Bash. Experience analyzing and tuning performance for a variety of EDA workloads, with excellent problem‑solving skills to identify bottlenecks and implement scalable solutions. Excellent communication and collaboration skills, with the ability to work effectively with various teams and individuals. Passion for continual learning and staying ahead of new technologies and effective approaches in the HPC infrastructure fields. Ways to stand out from the crowd: Background with NVIDIA GPUs, CUDA Programming, NCCL and MLPerf benchmarking. Experience supporting EDA workloads and tools. Familiarity with high‑speed networking pertaining to HPC, including InfiniBand, RDMA and RoCE. Understanding of fast, distributed storage systems such as Lustre and GPFS for AI/HPC workloads. Familiarity with metrics collection and visualization at scale with Prometheus, OpenSearch and Grafana. The base salary will be determined based on location, experience, and the pay of employees in similar positions. Base salary ranges are $152,000 – $241,500 for Level3 and $184,000 – $287,500 for Level4. You will also be eligible for equity and benefits. Applications will be accepted until March15,2026. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering a diverse work environment and is an equal‑opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. #J-18808-Ljbffr NVIDIA Gruppe

Apply

Vacancy posted 4 days ago

Similar jobs that could be interesting for youBased on the Senior HPC Cluster Engineer in Santa Clara, CA vacancy

Senior HPC Cluster Engineer
$152k - $241.5k
...Come join the team and see how you can make a lasting impact on the world. We are seeking a highly skilled and experienced HPC Cluster Engineer to design, deploy, and operate GPU Compute Clusters for EDA (Electronic Design Automation) and high-performance computing...
Senior
NVIDIA
Santa Clara, CA
3 days ago
Senior HPC Storage Engineer
$184k - $287.5k
...next-gen distributed storage services for HPC workloads, optimizing both performance... ...our researchers to run their flows on our clusters including performance analysis and... ...degree in Computer Science, Electrical Engineering or related field or equivalent experience...
Senior
NVIDIA
Santa Clara, CA
2 days ago
Senior HPC Engineer
$140k - $160k
...ASRC Federal is looking for a Senior HPC Engineer, as ASRC Federal InuTeq provides High Performance Computing services across the full HPC... ...Key Responsibilities: Design, deploy and maintain HPC clusters with over 2000+ nodes with InfiniBand, 100+ petabytes of data...
Senior
Contract work
Weekend work
ASRC Federal Holding Company
Mountain View, CA
3 days ago
Senior AI/HPC GPU Cluster Architect (Equity)
NVIDIA Gruppe in Santa Clara is seeking a technical leader for the GPU AI/HPC Infrastructure team. You will design and implement cutting-edge GPU compute clusters, focusing on deep learning and high-performance computing. The ideal candidate will have at least 5+ years...
Senior
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior Site Reliability Engineer - HPC
$152k - $241.5k
...Overview We’re looking for a Senior SRE to join our Compute Farm team... ...they integrate cleanly with HPC schedulers, storage, and network... ...supporting large‑scale HPC clusters using Slurm, LSF or Kubernetes... ...Perl, or Ruby. Mentored other engineers and influenced technical direction...
Senior
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior GPU Cluster Architect for AI/HPC Deployments
$184k - $356.5k
NVIDIA Gruppe is seeking an experienced engineer to lead GPU cluster design and support for AI and HPC deployments in Santa Clara, California. The ideal candidate will have over 8 years of experience with large-scale GPU infrastructure and a strong ability to communicate...
Senior
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior Math Libraries Engineer - AI and HPC
$184k - $287.5k
...NVIDIA Math Libraries team is looking for a senior engineer to join our development efforts in the area of kernel generation for AI and HPC, specifically targeting matrix operations, JITing and fusions. Around the world, leading commercial and academic organizations are...
Senior
Remote work
NVIDIA
Santa Clara, CA
19 hours ago
Senior GPU Performance Engineer - MPI/NCCL & HPC
$152k - $287.5k
NVIDIA Corporation is seeking a motivated Performance Engineer to enhance the roadmap of communication libraries. In this role, you will conduct in-depth performance characterization on multi-GPU clusters and analyze the interaction of libraries with hardware and software...
Senior
NVIDIA Corporation
Santa Clara, CA
4 days ago
Senior GPU Math Library Engineer: AI & HPC Kernel Lead
NVIDIA Gruppe is looking for a senior engineer to join their Math Libraries team in Santa Clara, California. This role involves designing and... .... The ideal candidate has over 8 years of experience in HPC software development using C++, along with leadership skills and...
Senior
NVIDIA Gruppe
Santa Clara, CA
4 days ago
HPC Operations Engineer, Compute Clusters & Automation
$124k - $195.5k
NVIDIA Gruppe in Santa Clara seeks an HPC Operations Engineer to design and implement compute clusters for silicon development. Ideal candidates will have experience troubleshooting in large-scale environments and enhancing deployment automation. Applicants should be proficient...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior HPC and LSF Operations Engineer
$152k - $241.5k
...environment remains resilient, measurable, and aligned with long-term engineering demands. What you'll be doing: Manage, scale, and... ...supporting and tuning job scheduling systems (LSF, Slurm, etc.) in HPC or silicon design environments Proficiency in Linux systems...
Senior
NVIDIA
Santa Clara, CA
3 days ago
Senior AI and ML HPC Cluster Engineer
As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the... ...of ground-breaking GPU compute clusters that run demanding deep learning, high-performance... ...degree in Computer Science, Electrical Engineering or related field or equivalent...
Senior
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior HPC Infra Engineer — Cloud-Native, Scalable Systems
$152k - $241.5k
NVIDIA Gruppe in Santa Clara is seeking a Senior Software Engineer to enhance their HPC infrastructure. The role involves applying distributed systems patterns, automation, and building scalable services in a hybrid multi-cloud environment. Candidates should have strong...
Senior
NVIDIA Gruppe
Santa Clara, CA
3 days ago
Senior Hardware Systems Engineer - AI Rack & Cluster Infrastructure
$131k - $175k
...Senior Hardware Systems Engineer – AI Rack & Cluster Infrastructure Arista Networks is an industry leader in data-driven, client-to-cloud networking for large data center, campus and routing environments. What sets us apart is our relentless pursuit of innovation....
Senior
Remote work
Flexible hours
Arista Networks, Inc.
Santa Clara, CA
19 hours ago
Senior ASIC DV Engineer — High-Speed Interconnects & HPC
A pioneering technology firm in Sunnyvale, CA is seeking an ASIC Design Verification Engineer to ensure the functional correctness of high-speed low-power digital integrated circuits. The ideal candidate will have significant experience in ASIC verification, particularly...
Senior
Avicena Inc.
Sunnyvale, CA
3 days ago
Senior HPC Network Engineer: RDMA, GPU Clusters
$200k - $400k
A dedicated research lab is seeking a Network Engineer to design and optimize low-latency, high-bandwidth networking solutions for AI supercomputing clusters. You will work on cutting-edge technologies in collaboration with world-class researchers. The ideal candidate has...
Senior
Institute of Foundation Models
Sunnyvale, CA
3 days ago
HPC Systems Engineer - Cluster Design & Deployment
KLA, located in Milpitas, California, is seeking an HPC Systems Architect to design and support HPC clusters vital for IC fabs and mask shops globally. The ideal candidate will have deep expertise in Linux systems and virtualization technologies, and will drive innovative...
KLA
Milpitas, CA
4 days ago
Senior Specialist Field Engineer - HPC/AI/ML
$165k - $220k
...aligns closely with the internal and customer engineering teams, offering valuable insights from... ...development. About the role: As a Senior Specialist Field Engineer CoreWeave, you... ...within high-performance compute (HPC) environments Collaborate closely with...
Senior
Permanent employment
Temporary work
Casual work
Work at office
Flexible hours
CoreWeave
Sunnyvale, CA
13 days ago
Senior Distributed Systems Engineer
$200k - $400k
...Institute Of Foundation Models Engineer The Institute of Foundation Models (IFM) designs... ...tolerant distributed execution under real-world cluster failures Core Technical Scope ·... ...links to relevant distributed systems, HPC, or large-scale training projects · Include...
Senior
Visa sponsorship
Institute of Foundation Models
Sunnyvale, CA
2 days ago
Senior Director, CTIO Engineering Technologists
$277k - $358k
...Job Description Senior Director, CTIO Engineering Technologists From applied research to advanced engineering, the Engineering... ...of their organization related to ~ HPC (high-performance compute) clusters, AI compute, AI Datacenter, AI Storage etc. ~...
Senior
Dell
Santa Clara, CA
3 days ago
Senior GPU Communications & HPC Software Engineer
$152k - $241.5k
NVIDIA Corporation is seeking a highly motivated Senior Software Engineer for its communication libraries and network software team in Santa Clara, California. This role involves designing and maintaining communication runtimes for Deep Learning frameworks and participating...
Senior
NVIDIA Corporation
Santa Clara, CA
4 days ago
Senior Software Engineer - Datacenter Systems
$184k - $287.5k
...datacenter provisioning and management. As a Senior Software Engineer - Datacenter Systems, you will work... ...technology supporting large-scale GPU clusters connected through NVLink and InfiniBand. These clusters run today's fastest HPC and AI workloads. This role suits...
Senior
Remote work
NVIDIA
Santa Clara, CA
3 days ago
Senior Deep Learning Framework Communications Engineer
...depth AI workload performance characterization on multi‑GPU clusters. Design fault‑tolerant and elastic solutions for large‑scale... ...or related field (or equivalent experience) with 5+ software engineering and HPC/AI experience Development or integration experience with...
Senior
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior HPC Scheduler & Reliability Engineer — Equity Options
NVIDIA Gruppe in Santa Clara is hiring for a role in their Hardware Infrastructure EDA Compute team to optimize workload scheduling systems and improve overall service reliability. The successful candidate will manage and scale job scheduling systems while driving measurable...
Senior
NVIDIA Gruppe
Santa Clara, CA
4 days ago
HPC Engineer
$150k - $300k
...performance, and availability of large-scale GPU clusters. • Respond to incidents and perform... ...'s degree in Computer Science, Computer Engineering, Software Engineering, Information... ...administration, SRE, DevOps, cloud operations, HPC, or infrastructure operations. •...
Night shift
Institute of Foundation Models
Sunnyvale, CA
2 days ago
HPC Performance Engineer
$165k - $242k
...HPC Performance Engineer CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology... ...bare-metal systems from POST through joining a Kubernetes cluster. The team's primary responsibilities include maintaining a...
Temporary work
Casual work
Work at office
Remote work
Flexible hours
CoreWeave
Sunnyvale, CA
3 days ago
Senior Software Engineer - HPC
$152k - $241.5k
...impact on the world. We are looking for a Senior Software Engineer to join our mission to continue improving our HPC infrastructure. Our team builds and operates sophisticated... ...core infrastructure or control planes for HPC clusters, large-scale AI/ML platforms, or systems...
Senior
NVIDIA
Santa Clara, CA
19 hours ago
Senior Software SDET Test Development Engineer
$140k - $224.25k
...markets include gaming, automotive, vision, HPC, datacenters and networking in addition... ...with various telemetries, scale out cluster, test plan development, track record in developing... ...) in a STEM (Science, Technology, Engineering, Math or Physics) field ~5+ years proven...
Senior
NVIDIA
Santa Clara, CA
2 days ago
GPU HPC Benchmarking & Telemetry Engineer
NVIDIA Gruppe in Santa Clara, California is seeking a skilled HPC/AI Benchmarking and Telemetry Engineer to join their team. In this role, you will develop benchmarking approaches for large-scale HPC and AI clusters, create telemetry frameworks to capture performance data,...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior Virtualization Validation Engineer
$172.5k - $210k
...the Role: As a Virtualization Validation Engineer , you will be responsible for the end-to-... ...validation of large-scale, multi-node GPU clusters. You will focus on high-performance GPU... ...power the world’s most demanding AI and HPC applications. San Francisco, Sunnyvale (Onsite...
Senior
Temporary work
Crusoe
Sunnyvale, CA
2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior HPC Cluster Engineer. Be the first to apply!