Senior AI and ML HPC Cluster Engineer

NVIDIA Gruppe

As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of ground-breaking GPU compute clusters that run demanding deep learning, high-performance computing, and computationally intensive workloads. We seek a technical leader to identify architectural changes and/or completely new approaches for our GPU Compute Clusters. As an expert, you will help us with the strategic challenges we encounter including: compute, networking, and storage design for large scale, high-performance workloads, effective resource utilization in a heterogeneous compute environment, evolving our private/public cloud strategy, capacity modeling, and growth planning across our global computing environment. What You'll Be Doing Provide leadership and strategic guidance on the management of large-scale HPC systems including the deployment of compute, networking, and storage. Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions. Build and maintain AI and ML heterogeneous clusters on-premises and in the cloud. Create and cultivate customer and cross-team relationships to reliably sustain the clusters and meet user evolving user needs. Support our researchers to run their workloads including performance analysis and optimizations. Conduct root cause analysis and suggest corrective action; proactively find and fix issues before they occur. What We Need to See Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience. Minimum 5+ years of experience designing and operating large scale compute infrastructure. Experience with AI/HPC advanced job schedulers, such as Slurm, K8s, PBS, RTDA or LSF. Proficient in administering CentOS/RHEL and/or Ubuntu Linux distributions. Solid understanding of cluster configuration management tools such as Ansible, Puppet, Salt. In-depth understanding of container technologies like Docker, Singularity, Podman, Shifter, Charliecloud. Proficiency in Python programming and Bash scripting. Applied experience with AI/HPC workflows that use MPI. Experience analyzing and tuning performance for a variety of AI/HPC workloads. Passion for continual learning and staying ahead of emerging technologies and effective approaches in the HPC and AI/ML infrastructure fields. Ways to Stand Out from the Crowd Background with NVIDIA GPUs, CUDA programming, NCCL and MLPerf benchmarking. Experience with Machine Learning and Deep Learning concepts, algorithms and models. Familiarity with InfiniBand with IPoIB and RDMA. Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads. Familiarity with deep learning frameworks like PyTorch and TensorFlow. Benefits Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is $152,000‑$241,500 USD for Level3, and $184,000‑$287,500 USD for Level4. You will also be eligible for equity and benefits. EEO Statement NVIDIA is committed to fostering a diverse work environment and is proud to be an equal opportunity employer. We do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law. #J-18808-Ljbffr NVIDIA Gruppe

Apply

Vacancy posted 1 day ago

Similar jobs that could be interesting for youBased on the Senior AI and ML HPC Cluster Engineer in Santa Clara, CA vacancy

Principal AI/ML Infra Engineer — GPU Clusters & HPC
$272k - $431.25k
NVIDIA Corporation seeks a Principal AI and ML Infra Software Engineer in Santa Clara, California, to enhance... ...efficiency of AI/ML research on GPU Clusters. The role involves collaboration with... ...should have extensive experience in HPC systems, programming, and a strong educational...
Suggested
NVIDIA Corporation
Santa Clara, CA
1 day ago
Senior Hardware Systems Engineer - AI Rack & Cluster Infrastructure
$131k - $175k
...Senior Hardware Systems Engineer – AI Rack & Cluster Infrastructure Arista Networks is an industry leader in data-driven, client-to-cloud networking for large... ...working directly with hyperscalers or large-scale AI/ML cluster deployments Experience building or...
Senior
Remote work
Flexible hours
Arista Networks, Inc.
Santa Clara, CA
2 days ago
Senior Specialist Field Engineer - HPC/AI/ML
$165k - $220k
...The Essential Cloud for AI™. Built for pioneers by... ...the internal and customer engineering teams, offering valuable... ...the role: As a Senior Specialist Field Engineer... ...offerings, focusing on AI/ML workloads within high-performance compute (HPC) environments Collaborate...
Senior
Permanent employment
Temporary work
Casual work
Work at office
Flexible hours
CoreWeave
Sunnyvale, CA
15 days ago
Senior Math Libraries Engineer - AI and HPC
$184k - $287.5k
...NVIDIA Math Libraries team is looking for a senior engineer to join our development efforts in the area of kernel generation for AI and HPC, specifically targeting matrix operations, JITing and fusions. Around the world, leading commercial and academic organizations are...
Senior
Remote work
NVIDIA
Santa Clara, CA
2 days ago
Senior HPC Cluster Engineer
$152k - $241.5k
...people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU... ...the world. We are seeking a highly skilled and experienced HPC Cluster Engineer to design, deploy, and operate GPU Compute Clusters for EDA (...
Senior
NVIDIA
Santa Clara, CA
17 hours ago
Senior HPC Cluster Engineer
NVIDIA is searching for a highly skilled HPC Cluster Engineer to design, deploy, and operate GPU Compute Clusters for Electronic Design Automation... ...management tools such as BCM or Ansible. Experience with AI/HPC job schedulers and orchestrators such as Slurm, LSF, PBS...
Senior
NVIDIA Gruppe
Santa Clara, CA
1 day ago
Senior AI/HPC GPU Cluster Architect (Equity)
NVIDIA Gruppe in Santa Clara is seeking a technical leader for the GPU AI/HPC Infrastructure team. You will design and implement cutting-edge GPU compute clusters, focusing on deep learning and high-performance computing. The ideal candidate will have at least 5+ years...
Senior
NVIDIA Gruppe
Santa Clara, CA
1 day ago
Senior GPU Cluster Architect for AI/HPC Deployments
$184k - $356.5k
NVIDIA Gruppe is seeking an experienced engineer to lead GPU cluster design and support for AI and HPC deployments in Santa Clara, California. The ideal candidate will have over 8 years of experience with large-scale GPU infrastructure and a strong ability to communicate...
Senior
NVIDIA Gruppe
Santa Clara, CA
1 day ago
Senior GPU HPC Cluster Engineer — Equity Eligible
NVIDIA Gruppe seeks a skilled HPC Cluster Engineer to design, deploy, and operate GPU Compute Clusters for high-performance computing workloads. This role involves collaboration with various teams to ensure effective and reliable cluster performance. Key responsibilities...
Senior
NVIDIA Gruppe
Santa Clara, CA
17 hours ago
Senior Site Reliability Engineer - HPC
$152k - $241.5k
...We’re looking for a Senior SRE to join our... ...harness the power of AI to deliver groundbreaking... ...cleanly with HPC schedulers, storage... ...supporting large‑scale HPC clusters using Slurm, LSF or... ...operations (AIOps/ML‑driven signals)... .... Mentored other engineers and influenced technical...
Senior
NVIDIA Gruppe
Santa Clara, CA
1 day ago
Senior GPU Math Library Engineer: AI & HPC Kernel Lead
NVIDIA Gruppe is looking for a senior engineer to join their Math Libraries team in Santa Clara, California... ...has over 8 years of experience in HPC software development using C++, along... ...the opportunity to be part of cutting-edge AI and data center technologies. #J-18808-Ljbffr...
Senior
NVIDIA Gruppe
Santa Clara, CA
1 day ago
Principal AI/ML Infra Engineer for GPU Clusters
NVIDIA Gruppe is seeking a Principal AI and ML Infra Software Engineer to join our Hardware Infrastructure team... ...infrastructure deficiencies for GPU Clusters, fostering innovations in AI/ML... ...15 years of experience in AI/ML and HPC, with a deep understanding of relevant...
NVIDIA Gruppe
Santa Clara, CA
1 day ago
Senior Deep Learning Engineer - Model Evaluation & AI Systems
$224k - $356.5k
...the unlimited potential of AI to define the next era of... ...performance computing. As a Senior / Principal Deep Learning Engineer — Model Evaluation & AI... ...pipelines running on large GPU clusters. Collaborate with and... ..., benchmarks, or ML infrastructure used by other...
Senior
NVIDIA
Santa Clara, CA
1 day ago
Senior Software Engineer, AI Inference Systems
$184k - $287.5k
...highly skilled and motivated software engineers to join us and build AI inference systems that serve large-... ...large-scale inference deployments on GPU clusters across clouds. Conduct and publish... ...the pareto frontier for the field of ML Systems; survey recent publications...
Senior
NVIDIA
Santa Clara, CA
1 day ago
Senior Software Engineer, Generative AI Systems
$152k - $241.5k
...seeking a highly motivated Software Engineer to join our growing AI and Generative AI engineering team. In... ...scalable infrastructure for large‑scale ML training, inference, and generative... ...‑native platforms supporting GPU clusters, fault‑tolerant training, and high‑performance...
Senior
NVIDIA Gruppe
Santa Clara, CA
3 days ago
Principal AI and ML Infra Software Engineer, GPU Clusters
$272k - $431.25k
...Principal Ai And Ml Infra Software Engineer, Gpu Clusters We are seeking a Principal AI and ML Infra Software Engineer, GPU Clusters at NVIDIA to join our... .... ~15+ years of demonstrated expertise in AI/ML and HPC tasks and systems. ~ Hands-on experience in using or...
NVIDIA
Santa Clara, CA
3 days ago
Senior HPC Storage Engineer
$184k - $287.5k
...into the unlimited potential of AI to define the next era of... ...distributed storage services for HPC workloads, optimizing both performance... ...to run their flows on our clusters including performance analysis... ...Computer Science, Electrical Engineering or related field or equivalent...
Senior
NVIDIA
Santa Clara, CA
4 days ago
Senior GPU Compiler Engineer — Hybrid, AI/ML Performance
Intel Corporation is seeking a Senior Compiler Engineer to develop and optimize compiler software for next-generation GPU architectures. The role... ...collaborating on cutting-edge compiler technologies that enhance AI and high-performance computing performance. The ideal...
Senior
Intel Corporation
Santa Clara, CA
1 day ago
Senior GPU Performance Engineer for AI Acceleration
$207k - $300k
Google is seeking an experienced AI/ML Software Engineer to enhance GPU architectures and optimize performance benchmarks. The role involves collaborating with teams to solve ML model challenges and architect transformative AI solutions, contributing to Google's machine...
Senior
Google
Sunnyvale, CA
3 days ago
Senior Staff Engineer, Voice AI & Conversational Systems
$180k - $200k
Uber is hiring a Senior Staff Engineer to architect and scale an autonomous support agent, enhancing customer experience using GenAI tools. The... ...have over 10 years of experience in building production ML/AI systems and will lead voice agent initiatives. This role offers...
Senior
Uber
Sunnyvale, CA
3 days ago
Senior Systems Engineer - AI Infrastructure
$150k - $230k
...Senior Systems Engineer - AI Infrastructure On Site, Palo Alto, California About the Role We're... ...implement low-level systems software for GPU clusters Work with internals of frameworks... ...networking (RDMA, InfiniBand) ML framework or runtime internals Cluster...
Senior
Clockwork Systems
Palo Alto, CA
17 hours ago
Senior ASIC Power Delivery Engineer for AI/ML TPU
$163k - $237k
Google Inc. is seeking an experienced candidate to shape the future of AI/ML hardware acceleration, focusing on TPU technology that powers demanding AI applications. You'll drive innovations by optimizing and verifying power delivery networks, ensuring reliability and performance...
Senior
Google Inc.
Sunnyvale, CA
2 days ago
Senior Analog Design Engineer, High-Speed AI Silicon
$240k - $334k
Google is seeking an Analog Design Engineer in Sunnyvale, California to shape the future of AI/ML hardware with advanced TPU technologies. You will define technical specifications and influence strategy to ensure innovative silicon solutions for deep learning applications...
Senior
Google
Sunnyvale, CA
2 days ago
Senior ML Systems Engineer — End-to-End AI Bring-Up
A leading AI technology company located in Sunnyvale, California, is looking for an experienced engineer to join its SOTA Training Platform team. The ideal candidate will have over... ...frameworks. Responsibilities include bringing ML models to life on Cerebras CSX systems,...
Senior
Cerebras
Sunnyvale, CA
4 days ago
Senior Power Analysis and Optimization Engineer, AI-LLM Systems
$136k - $218.5k
...We’re looking for a Senior Power Architecture & Optimization Engineer to push the limits of energy efficiency using advanced analytics and AI, including LLMs trained specifically for power analysis... ...‑aware models and flows, including ML/RL‑based techniques for anomaly...
Senior
NVIDIA
Santa Clara, CA
2 days ago
Senior Developer Technology Engineer - AI
We’re currently seeking a Senior Developer Technology Engineer, Artificial Intelligence! Would you enjoy researching... ...parallel algorithms to accelerate AI workloads on advanced computer... ...analysis and optimization of complex AI and HPC algorithms to ensure the best...
Senior
Work experience placement
NVIDIA Gruppe
Santa Clara, CA
1 day ago
Senior AI and HPC Observability Engineer
$152k - $241.5k
...powers everything from generative AI to autonomous systems, and we continue... ...tools that enable researchers and engineers to develop the next generation of AI/ML systems. By joining us, you’ll help... .... We are looking for a strong AI & HPC Observability Engineer to build and...
Senior
NVIDIA
Santa Clara, CA
4 days ago
Senior Cloud Automation Engineer - AI & ML Ops
...CrowdStrike, Inc. is seeking a Cloud Software Engineer to join the Falcon Complete AI Engineering Team in Sunnyvale, California. In this role, you will design, build, and deploy distributed cloud ecosystems using technologies such as Golang and Python. The ideal candidate...
Senior
CrowdStrike
Sunnyvale, CA
4 days ago
Senior Cloud-Native Systems Engineer, AI Datacenters
A leading technology company based in Santa Clara, California, is seeking a Senior Software Engineer to focus on the cloud-native stack for their AI/ML datacenters. This role entails deep technical work including debugging complex systems and gathering customer requirements...
Senior
NVIDIA Corporation
Santa Clara, CA
1 day ago
Senior ML Test Engineer - AI Compute & CI/CD Lead
$180k - $300k
MixMode is seeking a Principal Software ML Test Engineer to lead testing for the d-Matrix AI compute engine in Santa Clara, California. This role involves overseeing test planning, automation, and execution, while collaborating closely with software development teams....
Senior
MixMode
Santa Clara, CA
17 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior AI and ML HPC Cluster Engineer. Be the first to apply!