Senior AI and ML HPC Cluster Engineer
NVIDIA Gruppe
As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of ground-breaking GPU compute clusters that run demanding deep learning, high-performance computing, and computationally intensive workloads. We seek a technical leader to identify architectural changes and/or completely new approaches for our GPU Compute Clusters. As an expert, you will help us with the strategic challenges we encounter including: compute, networking, and storage design for large scale, high-performance workloads, effective resource utilization in a heterogeneous compute environment, evolving our private/public cloud strategy, capacity modeling, and growth planning across our global computing environment. What You'll Be Doing Provide leadership and strategic guidance on the management of large-scale HPC systems including the deployment of compute, networking, and storage. Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions. Build and maintain AI and ML heterogeneous clusters on-premises and in the cloud. Create and cultivate customer and cross-team relationships to reliably sustain the clusters and meet user evolving user needs. Support our researchers to run their workloads including performance analysis and optimizations. Conduct root cause analysis and suggest corrective action; proactively find and fix issues before they occur. What We Need to See Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience. Minimum 5+ years of experience designing and operating large scale compute infrastructure. Experience with AI/HPC advanced job schedulers, such as Slurm, K8s, PBS, RTDA or LSF. Proficient in administering CentOS/RHEL and/or Ubuntu Linux distributions. Solid understanding of cluster configuration management tools such as Ansible, Puppet, Salt. In-depth understanding of container technologies like Docker, Singularity, Podman, Shifter, Charliecloud. Proficiency in Python programming and Bash scripting. Applied experience with AI/HPC workflows that use MPI. Experience analyzing and tuning performance for a variety of AI/HPC workloads. Passion for continual learning and staying ahead of emerging technologies and effective approaches in the HPC and AI/ML infrastructure fields. Ways to Stand Out from the Crowd Background with NVIDIA GPUs, CUDA programming, NCCL and MLPerf benchmarking. Experience with Machine Learning and Deep Learning concepts, algorithms and models. Familiarity with InfiniBand with IPoIB and RDMA. Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads. Familiarity with deep learning frameworks like PyTorch and TensorFlow. Benefits Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is $152,000‑$241,500 USD for Level3, and $184,000‑$287,500 USD for Level4. You will also be eligible for equity and benefits. EEO Statement NVIDIA is committed to fostering a diverse work environment and is proud to be an equal opportunity employer. We do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law. #J-18808-Ljbffr NVIDIA Gruppe
$272k - $431.25k
NVIDIA Corporation seeks a Principal AI and ML Infra Software Engineer in Santa Clara, California, to enhance... ...efficiency of AI/ML research on GPU Clusters. The role involves collaboration with... ...should have extensive experience in HPC systems, programming, and a strong educational...Suggested$131k - $175k
...Senior Hardware Systems Engineer – AI Rack & Cluster Infrastructure Arista Networks is an industry leader in data-driven, client-to-cloud networking for large... ...working directly with hyperscalers or large-scale AI/ML cluster deployments Experience building or...SeniorRemote workFlexible hours$165k - $220k
...The Essential Cloud for AI™. Built for pioneers by... ...the internal and customer engineering teams, offering valuable... ...the role: As a Senior Specialist Field Engineer... ...offerings, focusing on AI/ML workloads within high-performance compute (HPC) environments Collaborate...SeniorPermanent employmentTemporary workCasual workWork at officeFlexible hours$184k - $287.5k
...NVIDIA Math Libraries team is looking for a senior engineer to join our development efforts in the area of kernel generation for AI and HPC, specifically targeting matrix operations, JITing and fusions. Around the world, leading commercial and academic organizations are...SeniorRemote work$152k - $241.5k
...people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU... ...the world. We are seeking a highly skilled and experienced HPC Cluster Engineer to design, deploy, and operate GPU Compute Clusters for EDA (...Senior- NVIDIA is searching for a highly skilled HPC Cluster Engineer to design, deploy, and operate GPU Compute Clusters for Electronic Design Automation... ...management tools such as BCM or Ansible. Experience with AI/HPC job schedulers and orchestrators such as Slurm, LSF, PBS...Senior
- NVIDIA Gruppe in Santa Clara is seeking a technical leader for the GPU AI/HPC Infrastructure team. You will design and implement cutting-edge GPU compute clusters, focusing on deep learning and high-performance computing. The ideal candidate will have at least 5+ years...Senior
$184k - $356.5k
NVIDIA Gruppe is seeking an experienced engineer to lead GPU cluster design and support for AI and HPC deployments in Santa Clara, California. The ideal candidate will have over 8 years of experience with large-scale GPU infrastructure and a strong ability to communicate...Senior- NVIDIA Gruppe seeks a skilled HPC Cluster Engineer to design, deploy, and operate GPU Compute Clusters for high-performance computing workloads. This role involves collaboration with various teams to ensure effective and reliable cluster performance. Key responsibilities...Senior
$152k - $241.5k
...We’re looking for a Senior SRE to join our... ...harness the power of AI to deliver groundbreaking... ...cleanly with HPC schedulers, storage... ...supporting large‑scale HPC clusters using Slurm, LSF or... ...operations (AIOps/ML‑driven signals)... .... Mentored other engineers and influenced technical...Senior- NVIDIA Gruppe is looking for a senior engineer to join their Math Libraries team in Santa Clara, California... ...has over 8 years of experience in HPC software development using C++, along... ...the opportunity to be part of cutting-edge AI and data center technologies. #J-18808-Ljbffr...Senior
- NVIDIA Gruppe is seeking a Principal AI and ML Infra Software Engineer to join our Hardware Infrastructure team... ...infrastructure deficiencies for GPU Clusters, fostering innovations in AI/ML... ...15 years of experience in AI/ML and HPC, with a deep understanding of relevant...
$224k - $356.5k
...the unlimited potential of AI to define the next era of... ...performance computing. As a Senior / Principal Deep Learning Engineer — Model Evaluation & AI... ...pipelines running on large GPU clusters. Collaborate with and... ..., benchmarks, or ML infrastructure used by other...Senior$184k - $287.5k
...highly skilled and motivated software engineers to join us and build AI inference systems that serve large-... ...large-scale inference deployments on GPU clusters across clouds. Conduct and publish... ...the pareto frontier for the field of ML Systems; survey recent publications...Senior$152k - $241.5k
...seeking a highly motivated Software Engineer to join our growing AI and Generative AI engineering team. In... ...scalable infrastructure for large‑scale ML training, inference, and generative... ...‑native platforms supporting GPU clusters, fault‑tolerant training, and high‑performance...Senior$272k - $431.25k
...Principal Ai And Ml Infra Software Engineer, Gpu Clusters We are seeking a Principal AI and ML Infra Software Engineer, GPU Clusters at NVIDIA to join our... .... ~15+ years of demonstrated expertise in AI/ML and HPC tasks and systems. ~ Hands-on experience in using or...$184k - $287.5k
...into the unlimited potential of AI to define the next era of... ...distributed storage services for HPC workloads, optimizing both performance... ...to run their flows on our clusters including performance analysis... ...Computer Science, Electrical Engineering or related field or equivalent...Senior- Intel Corporation is seeking a Senior Compiler Engineer to develop and optimize compiler software for next-generation GPU architectures. The role... ...collaborating on cutting-edge compiler technologies that enhance AI and high-performance computing performance. The ideal...Senior
$207k - $300k
Google is seeking an experienced AI/ML Software Engineer to enhance GPU architectures and optimize performance benchmarks. The role involves collaborating with teams to solve ML model challenges and architect transformative AI solutions, contributing to Google's machine...Senior$180k - $200k
Uber is hiring a Senior Staff Engineer to architect and scale an autonomous support agent, enhancing customer experience using GenAI tools. The... ...have over 10 years of experience in building production ML/AI systems and will lead voice agent initiatives. This role offers...Senior$150k - $230k
...Senior Systems Engineer - AI Infrastructure On Site, Palo Alto, California About the Role We're... ...implement low-level systems software for GPU clusters Work with internals of frameworks... ...networking (RDMA, InfiniBand) ML framework or runtime internals Cluster...Senior$163k - $237k
Google Inc. is seeking an experienced candidate to shape the future of AI/ML hardware acceleration, focusing on TPU technology that powers demanding AI applications. You'll drive innovations by optimizing and verifying power delivery networks, ensuring reliability and performance...Senior$240k - $334k
Google is seeking an Analog Design Engineer in Sunnyvale, California to shape the future of AI/ML hardware with advanced TPU technologies. You will define technical specifications and influence strategy to ensure innovative silicon solutions for deep learning applications...Senior- A leading AI technology company located in Sunnyvale, California, is looking for an experienced engineer to join its SOTA Training Platform team. The ideal candidate will have over... ...frameworks. Responsibilities include bringing ML models to life on Cerebras CSX systems,...Senior
$136k - $218.5k
...We’re looking for a Senior Power Architecture & Optimization Engineer to push the limits of energy efficiency using advanced analytics and AI, including LLMs trained specifically for power analysis... ...‑aware models and flows, including ML/RL‑based techniques for anomaly...Senior- We’re currently seeking a Senior Developer Technology Engineer, Artificial Intelligence! Would you enjoy researching... ...parallel algorithms to accelerate AI workloads on advanced computer... ...analysis and optimization of complex AI and HPC algorithms to ensure the best...SeniorWork experience placement
$152k - $241.5k
...powers everything from generative AI to autonomous systems, and we continue... ...tools that enable researchers and engineers to develop the next generation of AI/ML systems. By joining us, you’ll help... .... We are looking for a strong AI & HPC Observability Engineer to build and...Senior- ...CrowdStrike, Inc. is seeking a Cloud Software Engineer to join the Falcon Complete AI Engineering Team in Sunnyvale, California. In this role, you will design, build, and deploy distributed cloud ecosystems using technologies such as Golang and Python. The ideal candidate...Senior
- A leading technology company based in Santa Clara, California, is seeking a Senior Software Engineer to focus on the cloud-native stack for their AI/ML datacenters. This role entails deep technical work including debugging complex systems and gathering customer requirements...Senior
$180k - $300k
MixMode is seeking a Principal Software ML Test Engineer to lead testing for the d-Matrix AI compute engine in Santa Clara, California. This role involves overseeing test planning, automation, and execution, while collaborating closely with software development teams....Senior
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior AI and ML HPC Cluster Engineer. Be the first to apply!
- senior ml engineer Santa Clara, CA
- machine learning ai engineer Santa Clara, CA
- computer vision machine learning engineer Santa Clara, CA
- machine learning software engineer Santa Clara, CA
- ai ml engineer Santa Clara, CA
- machine learning engineer Santa Clara, CA
- senior game producer Santa Clara, CA
- senior manager process engineering Santa Clara, CA
- senior manufacturing engineer Santa Clara, CA
- senior manager clinical operations Santa Clara, CA


