Senior HPC Cluster Engineer
NVIDIA Gruppe
NVIDIA is searching for a highly skilled HPC Cluster Engineer to design, deploy, and operate GPU Compute Clusters for Electronic Design Automation and high-performance computing workloads across multiple teams and projects. The role collaborates with researchers and infrastructure teams to ensure clusters are performant, scalable, and reliable. What you'll be doing: Develop and enhance our ecosystem around GPU-accelerated computing, including developing scalable automation solutions. Continuously improve infrastructure provisioning, management, observability, and day-to-day operation through automation. Provide technical leadership and strategic guidance for managing large-scale HPC systems, including deployment of compute, networking, and storage. Foster strong customer and multi-functional partnerships to ensure consistent cluster support and rapidly adapt to evolving user needs. Support researchers to run their EDA workloads, including performance analysis and optimizations. Conduct root cause analysis and suggest corrective action; proactively find and fix issues before they occur. Build innovative tooling to accelerate researchers' velocity, debugging, and software performance at scale. What we need to see: Bachelor’s degree in Computer Science, Electrical Engineering, or related field, or equivalent experience. Minimum of 5 years of proven experience crafting and operating large-scale compute infrastructure, including cluster configuration management tools such as BCM or Ansible. Experience with AI/HPC job schedulers and orchestrators such as Slurm, LSF, PBS or K8s, and applied experience with AI/HPC workflows that use MPI and NCCL. Proficiency in using Linux including Rocky/Centos/RHEL and/or Ubuntu Linux distributions, with a solid understanding of container technologies such as Enroot and Docker. Proficiency in Python and Bash. Experience analyzing and tuning performance for a variety of EDA workloads, with excellent problem‑solving skills to identify bottlenecks and implement scalable solutions. Excellent communication and collaboration skills, with the ability to work effectively with various teams and individuals. Passion for continual learning and staying ahead of new technologies and effective approaches in the HPC infrastructure fields. Ways to stand out from the crowd: Background with NVIDIA GPUs, CUDA Programming, NCCL and MLPerf benchmarking. Experience supporting EDA workloads and tools. Familiarity with high‑speed networking pertaining to HPC, including InfiniBand, RDMA and RoCE. Understanding of fast, distributed storage systems such as Lustre and GPFS for AI/HPC workloads. Familiarity with metrics collection and visualization at scale with Prometheus, OpenSearch and Grafana. The base salary will be determined based on location, experience, and the pay of employees in similar positions. Base salary ranges are $152,000 – $241,500 for Level3 and $184,000 – $287,500 for Level4. You will also be eligible for equity and benefits. Applications will be accepted until March15,2026. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering a diverse work environment and is an equal‑opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. #J-18808-Ljbffr NVIDIA Gruppe
$152k - $241.5k
...Come join the team and see how you can make a lasting impact on the world. We are seeking a highly skilled and experienced HPC Cluster Engineer to design, deploy, and operate GPU Compute Clusters for EDA (Electronic Design Automation) and high-performance computing...Senior$184k - $287.5k
...next-gen distributed storage services for HPC workloads, optimizing both performance... ...our researchers to run their flows on our clusters including performance analysis and... ...degree in Computer Science, Electrical Engineering or related field or equivalent experience...Senior$140k - $160k
...ASRC Federal is looking for a Senior HPC Engineer, as ASRC Federal InuTeq provides High Performance Computing services across the full HPC... ...Key Responsibilities: Design, deploy and maintain HPC clusters with over 2000+ nodes with InfiniBand, 100+ petabytes of data...SeniorContract workWeekend work- NVIDIA Gruppe in Santa Clara is seeking a technical leader for the GPU AI/HPC Infrastructure team. You will design and implement cutting-edge GPU compute clusters, focusing on deep learning and high-performance computing. The ideal candidate will have at least 5+ years...Senior
$152k - $241.5k
...Overview We’re looking for a Senior SRE to join our Compute Farm team... ...they integrate cleanly with HPC schedulers, storage, and network... ...supporting large‑scale HPC clusters using Slurm, LSF or Kubernetes... ...Perl, or Ruby. Mentored other engineers and influenced technical direction...Senior$184k - $356.5k
NVIDIA Gruppe is seeking an experienced engineer to lead GPU cluster design and support for AI and HPC deployments in Santa Clara, California. The ideal candidate will have over 8 years of experience with large-scale GPU infrastructure and a strong ability to communicate...Senior$184k - $287.5k
...NVIDIA Math Libraries team is looking for a senior engineer to join our development efforts in the area of kernel generation for AI and HPC, specifically targeting matrix operations, JITing and fusions. Around the world, leading commercial and academic organizations are...SeniorRemote work$152k - $287.5k
NVIDIA Corporation is seeking a motivated Performance Engineer to enhance the roadmap of communication libraries. In this role, you will conduct in-depth performance characterization on multi-GPU clusters and analyze the interaction of libraries with hardware and software...Senior- NVIDIA Gruppe is looking for a senior engineer to join their Math Libraries team in Santa Clara, California. This role involves designing and... .... The ideal candidate has over 8 years of experience in HPC software development using C++, along with leadership skills and...Senior
$124k - $195.5k
NVIDIA Gruppe in Santa Clara seeks an HPC Operations Engineer to design and implement compute clusters for silicon development. Ideal candidates will have experience troubleshooting in large-scale environments and enhancing deployment automation. Applicants should be proficient...$152k - $241.5k
...environment remains resilient, measurable, and aligned with long-term engineering demands. What you'll be doing: Manage, scale, and... ...supporting and tuning job scheduling systems (LSF, Slurm, etc.) in HPC or silicon design environments Proficiency in Linux systems...Senior- As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the... ...of ground-breaking GPU compute clusters that run demanding deep learning, high-performance... ...degree in Computer Science, Electrical Engineering or related field or equivalent...Senior
$152k - $241.5k
NVIDIA Gruppe in Santa Clara is seeking a Senior Software Engineer to enhance their HPC infrastructure. The role involves applying distributed systems patterns, automation, and building scalable services in a hybrid multi-cloud environment. Candidates should have strong...Senior$131k - $175k
...Senior Hardware Systems Engineer – AI Rack & Cluster Infrastructure Arista Networks is an industry leader in data-driven, client-to-cloud networking for large data center, campus and routing environments. What sets us apart is our relentless pursuit of innovation....SeniorRemote workFlexible hours- A pioneering technology firm in Sunnyvale, CA is seeking an ASIC Design Verification Engineer to ensure the functional correctness of high-speed low-power digital integrated circuits. The ideal candidate will have significant experience in ASIC verification, particularly...Senior
$200k - $400k
A dedicated research lab is seeking a Network Engineer to design and optimize low-latency, high-bandwidth networking solutions for AI supercomputing clusters. You will work on cutting-edge technologies in collaboration with world-class researchers. The ideal candidate has...Senior- KLA, located in Milpitas, California, is seeking an HPC Systems Architect to design and support HPC clusters vital for IC fabs and mask shops globally. The ideal candidate will have deep expertise in Linux systems and virtualization technologies, and will drive innovative...
$165k - $220k
...aligns closely with the internal and customer engineering teams, offering valuable insights from... ...development. About the role: As a Senior Specialist Field Engineer CoreWeave, you... ...within high-performance compute (HPC) environments Collaborate closely with...SeniorPermanent employmentTemporary workCasual workWork at officeFlexible hours$200k - $400k
...Institute Of Foundation Models Engineer The Institute of Foundation Models (IFM) designs... ...tolerant distributed execution under real-world cluster failures Core Technical Scope ·... ...links to relevant distributed systems, HPC, or large-scale training projects · Include...SeniorVisa sponsorship$277k - $358k
...Job Description Senior Director, CTIO Engineering Technologists From applied research to advanced engineering, the Engineering... ...of their organization related to ~ HPC (high-performance compute) clusters, AI compute, AI Datacenter, AI Storage etc. ~...Senior$152k - $241.5k
NVIDIA Corporation is seeking a highly motivated Senior Software Engineer for its communication libraries and network software team in Santa Clara, California. This role involves designing and maintaining communication runtimes for Deep Learning frameworks and participating...Senior$184k - $287.5k
...datacenter provisioning and management. As a Senior Software Engineer - Datacenter Systems, you will work... ...technology supporting large-scale GPU clusters connected through NVLink and InfiniBand. These clusters run today's fastest HPC and AI workloads. This role suits...SeniorRemote work- ...depth AI workload performance characterization on multi‑GPU clusters. Design fault‑tolerant and elastic solutions for large‑scale... ...or related field (or equivalent experience) with 5+ software engineering and HPC/AI experience Development or integration experience with...Senior
- NVIDIA Gruppe in Santa Clara is hiring for a role in their Hardware Infrastructure EDA Compute team to optimize workload scheduling systems and improve overall service reliability. The successful candidate will manage and scale job scheduling systems while driving measurable...Senior
$150k - $300k
...performance, and availability of large-scale GPU clusters. • Respond to incidents and perform... ...'s degree in Computer Science, Computer Engineering, Software Engineering, Information... ...administration, SRE, DevOps, cloud operations, HPC, or infrastructure operations. •...Night shift$165k - $242k
...HPC Performance Engineer CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology... ...bare-metal systems from POST through joining a Kubernetes cluster. The team's primary responsibilities include maintaining a...Temporary workCasual workWork at officeRemote workFlexible hours$152k - $241.5k
...impact on the world. We are looking for a Senior Software Engineer to join our mission to continue improving our HPC infrastructure. Our team builds and operates sophisticated... ...core infrastructure or control planes for HPC clusters, large-scale AI/ML platforms, or systems...Senior$140k - $224.25k
...markets include gaming, automotive, vision, HPC, datacenters and networking in addition... ...with various telemetries, scale out cluster, test plan development, track record in developing... ...) in a STEM (Science, Technology, Engineering, Math or Physics) field ~5+ years proven...Senior- NVIDIA Gruppe in Santa Clara, California is seeking a skilled HPC/AI Benchmarking and Telemetry Engineer to join their team. In this role, you will develop benchmarking approaches for large-scale HPC and AI clusters, create telemetry frameworks to capture performance data,...
$172.5k - $210k
...the Role: As a Virtualization Validation Engineer , you will be responsible for the end-to-... ...validation of large-scale, multi-node GPU clusters. You will focus on high-performance GPU... ...power the world’s most demanding AI and HPC applications. San Francisco, Sunnyvale (Onsite...SeniorTemporary work
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior HPC Cluster Engineer. Be the first to apply!
- senior game producer Santa Clara, CA
- senior manager process engineering Santa Clara, CA
- senior manufacturing engineer Santa Clara, CA
- senior manager clinical operations Santa Clara, CA
- senior optical engineer Santa Clara, CA
- senior lead project manager Santa Clara, CA
- senior manager quality engineering Santa Clara, CA
- senior device engineer Santa Clara, CA
- senior full stack developer Santa Clara, CA
- senior planner Santa Clara, CA


