Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior HPC Cluster Engineer

NVIDIA Gruppe

NVIDIA is searching for a highly skilled HPC Cluster Engineer to design, deploy, and operate GPU Compute Clusters for Electronic Design Automation and high-performance computing workloads across multiple teams and projects. The role collaborates with researchers and infrastructure teams to ensure clusters are performant, scalable, and reliable. What you'll be doing: Develop and enhance our ecosystem around GPU-accelerated computing, including developing scalable automation solutions. Continuously improve infrastructure provisioning, management, observability, and day-to-day operation through automation. Provide technical leadership and strategic guidance for managing large-scale HPC systems, including deployment of compute, networking, and storage. Foster strong customer and multi-functional partnerships to ensure consistent cluster support and rapidly adapt to evolving user needs. Support researchers to run their EDA workloads, including performance analysis and optimizations. Conduct root cause analysis and suggest corrective action; proactively find and fix issues before they occur. Build innovative tooling to accelerate researchers' velocity, debugging, and software performance at scale. What we need to see: Bachelor’s degree in Computer Science, Electrical Engineering, or related field, or equivalent experience. Minimum of 5 years of proven experience crafting and operating large-scale compute infrastructure, including cluster configuration management tools such as BCM or Ansible. Experience with AI/HPC job schedulers and orchestrators such as Slurm, LSF, PBS or K8s, and applied experience with AI/HPC workflows that use MPI and NCCL. Proficiency in using Linux including Rocky/Centos/RHEL and/or Ubuntu Linux distributions, with a solid understanding of container technologies such as Enroot and Docker. Proficiency in Python and Bash. Experience analyzing and tuning performance for a variety of EDA workloads, with excellent problem‑solving skills to identify bottlenecks and implement scalable solutions. Excellent communication and collaboration skills, with the ability to work effectively with various teams and individuals. Passion for continual learning and staying ahead of new technologies and effective approaches in the HPC infrastructure fields. Ways to stand out from the crowd: Background with NVIDIA GPUs, CUDA Programming, NCCL and MLPerf benchmarking. Experience supporting EDA workloads and tools. Familiarity with high‑speed networking pertaining to HPC, including InfiniBand, RDMA and RoCE. Understanding of fast, distributed storage systems such as Lustre and GPFS for AI/HPC workloads. Familiarity with metrics collection and visualization at scale with Prometheus, OpenSearch and Grafana. The base salary will be determined based on location, experience, and the pay of employees in similar positions. Base salary ranges are $152,000 – $241,500 for Level3 and $184,000 – $287,500 for Level4. You will also be eligible for equity and benefits. Applications will be accepted until March15,2026. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering a diverse work environment and is an equal‑opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. #J-18808-Ljbffr NVIDIA Gruppe

Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the Senior HPC Cluster Engineer in Santa Clara, CA vacancy
  • $152k - $241.5k

     ...Come join the team and see how you can make a lasting impact on the world. We are seeking a highly skilled and experienced HPC Cluster Engineer to design, deploy, and operate GPU Compute Clusters for EDA (Electronic Design Automation) and high-performance computing... 
    Senior

    NVIDIA

    Santa Clara, CA
    3 days ago
  • $184k - $287.5k

     ...next-gen distributed storage services for HPC workloads, optimizing both performance...  ...our researchers to run their flows on our clusters including performance analysis and...  ...degree in Computer Science, Electrical Engineering or related field or equivalent experience... 
    Senior

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $140k - $160k

     ...ASRC Federal is looking for a Senior HPC Engineer, as ASRC Federal InuTeq provides High Performance Computing services across the full HPC...  ...Key Responsibilities: Design, deploy and maintain HPC clusters with over 2000+ nodes with InfiniBand, 100+ petabytes of data... 
    Senior
    Contract work
    Weekend work

    ASRC Federal Holding Company

    Mountain View, CA
    3 days ago
  • NVIDIA Gruppe in Santa Clara is seeking a technical leader for the GPU AI/HPC Infrastructure team. You will design and implement cutting-edge GPU compute clusters, focusing on deep learning and high-performance computing. The ideal candidate will have at least 5+ years... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $152k - $241.5k

     ...Overview We’re looking for a Senior SRE to join our Compute Farm team...  ...they integrate cleanly with HPC schedulers, storage, and network...  ...supporting large‑scale HPC clusters using Slurm, LSF or Kubernetes...  ...Perl, or Ruby. Mentored other engineers and influenced technical direction... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $184k - $356.5k

    NVIDIA Gruppe is seeking an experienced engineer to lead GPU cluster design and support for AI and HPC deployments in Santa Clara, California. The ideal candidate will have over 8 years of experience with large-scale GPU infrastructure and a strong ability to communicate... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $184k - $287.5k

     ...NVIDIA Math Libraries team is looking for a senior engineer to join our development efforts in the area of kernel generation for AI and HPC, specifically targeting matrix operations, JITing and fusions. Around the world, leading commercial and academic organizations are... 
    Senior
    Remote work

    NVIDIA

    Santa Clara, CA
    19 hours ago
  • $152k - $287.5k

    NVIDIA Corporation is seeking a motivated Performance Engineer to enhance the roadmap of communication libraries. In this role, you will conduct in-depth performance characterization on multi-GPU clusters and analyze the interaction of libraries with hardware and software... 
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    4 days ago
  • NVIDIA Gruppe is looking for a senior engineer to join their Math Libraries team in Santa Clara, California. This role involves designing and...  .... The ideal candidate has over 8 years of experience in HPC software development using C++, along with leadership skills and... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $124k - $195.5k

    NVIDIA Gruppe in Santa Clara seeks an HPC Operations Engineer to design and implement compute clusters for silicon development. Ideal candidates will have experience troubleshooting in large-scale environments and enhancing deployment automation. Applicants should be proficient... 

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $152k - $241.5k

     ...environment remains resilient, measurable, and aligned with long-term engineering demands. What you'll be doing: Manage, scale, and...  ...supporting and tuning job scheduling systems (LSF, Slurm, etc.) in HPC or silicon design environments Proficiency in Linux systems... 
    Senior

    NVIDIA

    Santa Clara, CA
    3 days ago
  • As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the...  ...of ground-breaking GPU compute clusters that run demanding deep learning, high-performance...  ...degree in Computer Science, Electrical Engineering or related field or equivalent... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $152k - $241.5k

    NVIDIA Gruppe in Santa Clara is seeking a Senior Software Engineer to enhance their HPC infrastructure. The role involves applying distributed systems patterns, automation, and building scalable services in a hybrid multi-cloud environment. Candidates should have strong... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • $131k - $175k

     ...Senior Hardware Systems Engineer – AI Rack & Cluster Infrastructure Arista Networks is an industry leader in data-driven, client-to-cloud networking for large data center, campus and routing environments. What sets us apart is our relentless pursuit of innovation.... 
    Senior
    Remote work
    Flexible hours

    Arista Networks, Inc.

    Santa Clara, CA
    19 hours ago
  • A pioneering technology firm in Sunnyvale, CA is seeking an ASIC Design Verification Engineer to ensure the functional correctness of high-speed low-power digital integrated circuits. The ideal candidate will have significant experience in ASIC verification, particularly... 
    Senior

    Avicena Inc.

    Sunnyvale, CA
    3 days ago
  • $200k - $400k

    A dedicated research lab is seeking a Network Engineer to design and optimize low-latency, high-bandwidth networking solutions for AI supercomputing clusters. You will work on cutting-edge technologies in collaboration with world-class researchers. The ideal candidate has... 
    Senior

    Institute of Foundation Models

    Sunnyvale, CA
    3 days ago
  • KLA, located in Milpitas, California, is seeking an HPC Systems Architect to design and support HPC clusters vital for IC fabs and mask shops globally. The ideal candidate will have deep expertise in Linux systems and virtualization technologies, and will drive innovative... 

    KLA

    Milpitas, CA
    4 days ago
  • $165k - $220k

     ...aligns closely with the internal and customer engineering teams, offering valuable insights from...  ...development. About the role: As a Senior Specialist Field Engineer CoreWeave, you...  ...within high-performance compute (HPC) environments Collaborate closely with... 
    Senior
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    13 days ago
  • $200k - $400k

     ...Institute Of Foundation Models Engineer The Institute of Foundation Models (IFM) designs...  ...tolerant distributed execution under real-world cluster failures Core Technical Scope ·...  ...links to relevant distributed systems, HPC, or large-scale training projects · Include... 
    Senior
    Visa sponsorship

    Institute of Foundation Models

    Sunnyvale, CA
    2 days ago
  • $277k - $358k

     ...Job Description Senior Director, CTIO Engineering Technologists From applied research to advanced engineering, the Engineering...  ...of their organization related to ~ HPC (high-performance compute) clusters, AI compute, AI Datacenter, AI Storage etc. ~... 
    Senior

    Dell

    Santa Clara, CA
    3 days ago
  • $152k - $241.5k

    NVIDIA Corporation is seeking a highly motivated Senior Software Engineer for its communication libraries and network software team in Santa Clara, California. This role involves designing and maintaining communication runtimes for Deep Learning frameworks and participating... 
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    4 days ago
  • $184k - $287.5k

     ...datacenter provisioning and management. As a Senior Software Engineer - Datacenter Systems, you will work...  ...technology supporting large-scale GPU clusters connected through NVLink and InfiniBand. These clusters run today's fastest HPC and AI workloads. This role suits... 
    Senior
    Remote work

    NVIDIA

    Santa Clara, CA
    3 days ago
  •  ...depth AI workload performance characterization on multi‑GPU clusters. Design fault‑tolerant and elastic solutions for large‑scale...  ...or related field (or equivalent experience) with 5+ software engineering and HPC/AI experience Development or integration experience with... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • NVIDIA Gruppe in Santa Clara is hiring for a role in their Hardware Infrastructure EDA Compute team to optimize workload scheduling systems and improve overall service reliability. The successful candidate will manage and scale job scheduling systems while driving measurable...
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $150k - $300k

     ...performance, and availability of large-scale GPU clusters. • Respond to incidents and perform...  ...'s degree in Computer Science, Computer Engineering, Software Engineering, Information...  ...administration, SRE, DevOps, cloud operations, HPC, or infrastructure operations. •... 
    Night shift

    Institute of Foundation Models

    Sunnyvale, CA
    2 days ago
  • $165k - $242k

     ...HPC Performance Engineer CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology...  ...bare-metal systems from POST through joining a Kubernetes cluster. The team's primary responsibilities include maintaining a... 
    Temporary work
    Casual work
    Work at office
    Remote work
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    3 days ago
  • $152k - $241.5k

     ...impact on the world. We are looking for a Senior Software Engineer to join our mission to continue improving our HPC infrastructure. Our team builds and operates sophisticated...  ...core infrastructure or control planes for HPC clusters, large-scale AI/ML platforms, or systems... 
    Senior

    NVIDIA

    Santa Clara, CA
    19 hours ago
  • $140k - $224.25k

     ...markets include gaming, automotive, vision, HPC, datacenters and networking in addition...  ...with various telemetries, scale out cluster, test plan development, track record in developing...  ...) in a STEM (Science, Technology, Engineering, Math or Physics) field ~5+ years proven... 
    Senior

    NVIDIA

    Santa Clara, CA
    2 days ago
  • NVIDIA Gruppe in Santa Clara, California is seeking a skilled HPC/AI Benchmarking and Telemetry Engineer to join their team. In this role, you will develop benchmarking approaches for large-scale HPC and AI clusters, create telemetry frameworks to capture performance data,... 

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $172.5k - $210k

     ...the Role: As a Virtualization Validation Engineer , you will be responsible for the end-to-...  ...validation of large-scale, multi-node GPU clusters. You will focus on high-performance GPU...  ...power the world’s most demanding AI and HPC applications. San Francisco, Sunnyvale (Onsite... 
    Senior
    Temporary work

    Crusoe

    Sunnyvale, CA
    2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior HPC Cluster Engineer. Be the first to apply!