Principal AI and ML Infra Software Engineer, GPU Clusters

$272k - $431.25k

NVIDIA

We are seeking a Principal AI and ML Infra Software Engineer, GPU Clusters at NVIDIA to join our Hardware Infrastructure team. As an Engineer, you will have a pivotal role in enhancing efficiency for our researchers by implementing progressions throughout the entire stack. Your main task will revolve around collaborating closely with customers to pinpoint and address infrastructure deficiencies, facilitating groundbreaking AI and ML research on GPU Clusters. Together, we can craft potent, effective, and scalable solutions as we mold the future of AI/ML technology!**What you will be doing:*** Engage closely with our AI and ML research teams to discern their infrastructure requirements and barriers, converting those insights into actionable improvements.* Proactively identify researcher efficiency bottlenecks and lead initiatives to systematically improve it. Drive the direction and long-term roadmaps for such initiatives.* Monitor and optimize the performance of our infrastructure ensuring high availability, scalability, and efficient resource utilization.* Help define and improve important measures of AI researcher efficiency, ensuring that our actions are in line with measurable results.* Work closely with a variety of teams, such as researchers, data engineers, and DevOps professionals, to develop a cohesive AI/ML infrastructure ecosystem.* Keep up to date with the most recent developments in AI/ML technologies, frameworks, and successful strategies, and advocate for their integration within the organization.**What we need to see:*** BS or similar background in Computer Science or related area (or equivalent experience).* 15+ years of demonstrated expertise in AI/ML and HPC tasks and systems.* Hands-on experience in using or operating High Performance Computing (HPC) grade infrastructure as well as in-depth knowledge of accelerated computing (e.g., GPU, custom silicon), storage (e.g., Lustre, GPFS, BeeGFS), scheduling & orchestration (e.g., Slurm, Kubernetes, LSF), high-speed networking (e.g., Infiniband, RoCE, Amazon EFA), and containers technologies (Docker, Enroot).* Capability in supervising and improving substantial distributed training operations using PyTorch (DDP, FSDP), NeMo, or JAX. Moreover, an in-depth understanding of AI/ML workflows, involving data processing, model training, and inference pipelines.* Proficiency in programming & scripting languages such as Python, Go, Bash, as well as familiarity with cloud computing platforms (e.g., AWS, GCP, Azure) in addition to experience with parallel computing frameworks and paradigms.* Dedication to ongoing learning and staying updated on new technologies and innovative methods in the AI/ML infrastructure sector.* Excellent communication and collaboration skills, with the ability to work effectively with teams and individuals of different backgrounds.NVIDIA offers competitive salaries and a comprehensive benefits package. Our engineering teams are growing rapidly due to outstanding expansion. If you're a passionate and independent engineer with a love for technology, we want to hear from you.Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 272,000 USD - 431,250 USD.You will also be eligible for equity and .Applications for this job will be accepted at least until May 1, 2026.This posting is for an existing vacancy.NVIDIA uses AI tools in its recruiting processes.NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. #J-18808-Ljbffr

Apply

Vacancy posted 11 hours ago

Similar jobs that could be interesting for youBased on the Principal AI and ML Infra Software Engineer, GPU Clusters in Santa Clara, CA vacancy

Principal AI and ML Infra Software Engineer, GPU Clusters
$272k - $431.25k
...We are seeking a Principal AI and ML Infra Software Engineer, GPU Clusters at NVIDIA to join our Hardware Infrastructure team. As an Engineer, you will have a pivotal role in enhancing efficiency for our researchers by implementing progressions throughout the entire stack...
Principal
NVIDIA Gruppe
Santa Clara, CA
1 day ago
Principal AI/ML Infra Engineer for GPU Clusters
...NVIDIA Gruppe is seeking a Principal AI and ML Infra Software Engineer to join our Hardware Infrastructure team in Santa Clara, CA. In this role, you... ...efficiency by addressing infrastructure deficiencies for GPU Clusters, fostering innovations in AI/ML research. The ideal...
Principal
NVIDIA Gruppe
Santa Clara, CA
11 hours ago
Principal AI/ML Infra Engineer GPU Clusters & HPC
$272k - $431.25k
...NVIDIA Corporation seeks a Principal AI and ML Infra Software Engineer in Santa Clara, California, to enhance the efficiency of AI/ML research on GPU Clusters. The role involves collaboration with various teams, monitoring infrastructure performance, and implementing improvements...
Principal
NVIDIA
Santa Clara, CA
11 hours ago
Senior Software Engineer - AI Research Clusters
$152k - $241.5k
...Visualization. Our invention—the GPU—functions as the visual... ...from generative AI to autonomous vehicles.... ...looking for a Senior Software Engineer to help accelerate the... ...performance‑optimal GPU clusters to internal researchers... ...the most advanced ML models on some of the world...
Suggested
NVIDIA Gruppe
Santa Clara, CA
12 hours ago
Principal AI Inference Systems Engineer
...computing experiences-from AI and data centers, to... ...a Senior Staff AI Infra Engineer who is passionate... ...special focus on AI/ML workloads and GPU-accelerated computing... ...intersection of hardware and software to optimize... ...AI workloads on GPU clusters, including large-scale...
Principal
Advanced Micro Devices , Inc.
Santa Clara, CA
5 days ago
Principal AI/ML Engineer, AV ML Infra
$275.8k - $340.5k
...Position Overview The Principal AI/ML Engineer will lead a growing organization, guiding the AV ML Infra team in achieving its mission while shaping long‑term vision and execution strategies across GM’s AI and ML efforts. This leadership role will drive a transformative...
Principal
Local area
Remote work
Relocation
Relocation package
Flexible hours
General Motors
Sunnyvale, CA
10 hours ago
Senior AI Platform Engineer - GPU Research Clusters
$152k - $287.5k
A leading technology company is seeking a Senior Software Engineer to develop solutions for GPU clusters aimed at enhancing machine learning innovation. The ideal... ...engineering with significant involvement in ML infrastructure, strong coding skills in Python, C++,...
NVIDIA Corporation
Santa Clara, CA
4 days ago
Principal AI/ML Engineer, AV ML Infra
$275.8k - $340.5k
...About the team: The AV ML Infra team at GM builds ML infrastructure... ...meet the unique demands of AI and ML innovation, supporting... ...the productivity of ML engineers, and drive the adoption of cutting... ...Position Overview: The Principal AI/ML Engineer will lead a growing...
Principal
Local area
Remote work
Work from home
Relocation
Relocation package
Flexible hours
General Motors
Sunnyvale, CA
5 days ago
Senior Software Engineer, AI Infra Compute
$212.8k - $387.6k
...years experience as a senior development engineer. 3. 3 years experience of building large-scale... ...understanding of cloud infrastructure or AI infrastructure. 6. Familiar with at least one of the areas below: GPU Infra (GPU cluster management, job scheduling, collective...
Temporary work
Local area
ByteDance
San Jose, CA
4 days ago
Software Engineer, AI Workload Scheduling, AI Infra Supercomputing
$140k - $252k
...What to Expect As a Software Engineer within the Supercomputing AI Infrastructure team, you will... ...optimizing our training compute clusters at the core of Robotaxi... ...currently scaling 100K+ GPU clusters, which are... ...Work closely with the ML team to understand workload...
Hourly pay
Full time
Temporary work
Flexible hours
Tesla
Palo Alto, CA
1 day ago
Principal AI Performance Architect for Scalable GPU Training
Advanced Micro Devices is looking for a Principal Engineer in Santa Clara, CA to lead AI infrastructure development, define GPU architecture specifications, and drive performance gains in ML systems. The role involves leading innovative techniques, collaborating with stakeholders...
Principal
Advanced Micro Devices
Santa Clara, CA
2 days ago
Principal AI/ML Engineer, AV ML Infra
$275.8k - $340.5k
...About the Team The AV ML Infra team at GM builds ML infrastructure... ...to meet the unique demands of AI and ML innovation, supporting... ...enhance the productivity of ML engineers, and drive the adoption of... ...techniques. Position Overview The Principal AI/ML Engineer will lead a...
Principal
Remote work
Relocation
Relocation package
Flexible hours
General Motors
Mountain View, CA
11 hours ago
Senior ML Infra Engineer - GPU Clusters, Reliability & Ops
$152k - $287.5k
...NVIDIA Gruppe, based in Santa Clara, is seeking a Senior Software Engineer to accelerate the development of machine learning innovations. In this role, you'll design and implement solutions for GPU clusters, enabling researchers to optimize their work. Strong expertise...
NVIDIA Gruppe
Santa Clara, CA
1 day ago
Principal Tech Lead Manager - Embodied AI Evaluation Foundations
$296.3k
...Foundations team in Embodied AI and is responsible for... ...high‑impact team of AI/ML engineers, data scientists and... ...vehicles. Role As a Principal Engineer in the Embodied... ...pipelines on modern cloud / GPU infrastructure, with... .../Mining/Quality and Infra Foundations to turn evaluation...
Principal
Local area
Flexible hours
General Motors
Sunnyvale, CA
4 days ago
Platform Engineer AI Infra & GPU Compute
...Tensec is seeking an experienced Platform Engineer to develop and operate a hybrid infrastructure for AI/ML research and product development. You will architect and... ...enhancing the platform’s performance for high-demand GPU workloads, directly impacting AI model deployments...
Tensec
Palo Alto, CA
1 day ago
Senior Software Engineer, AI Frameworks
$152k - $241.5k
...We are seeking a Senior Software Engineer to drive integration of... ...of leading open-source AI frameworks. In this... ...including multi-node and multi-GPU environments. Improve... ...have), and debugging clusters. Familiarity with... ...ecosystem, or related ML infrastructure projects...
NVIDIA Gruppe
Santa Clara, CA
12 hours ago
Senior Software Engineer, Cluster Orchestration
$139k - $204k
...Senior Software Engineer, Cluster Orchestration CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers... ...efficiently across massive GPU clusters. By building the systems... ..., GPU-based applications, or ML pipelines. Knowledge of...
Permanent employment
Temporary work
Casual work
Work at office
Remote work
Flexible hours
CoreWeave
Sunnyvale, CA
3 days ago
Senior Staff Software Development Engineer- GPU/AI/ML
...computing experiences-from AI and data centers, to... ...latest hardware and software technology. THE PERSON... ...from the lowest-level GPU kernels to large-scale... ...deep passion for software engineering, strong technical... ...the C++/HIP/CUDA core of ML frameworks like PyTorch...
Advanced Micro Devices , Inc.
Santa Clara, CA
5 days ago
Senior AI Infrastructure Engineer, Distributed GPU Clusters
$184k - $356.5k
...NVIDIA Corporation is seeking a Senior Software Engineer in Santa Clara to enhance the performance and reliability of large-scale AI infrastructures. The role involves leadership... ...distributed training workloads across NVIDIA’s GPU platforms. Ideal candidates should have...
NVIDIA
Santa Clara, CA
1 day ago
Senior Software Engineer, AI Networking
$152k - $241.5k
...NVIDIA seeks a senior software engineer to join the AI Networking co-design and benchmark R&D team. In... ...tools. These include tools that use ML-based combinatorial optimization and... ...optimize AI workloads across large GPU and CPU clusters, thereby ensuring the most...
NVIDIA
Santa Clara, CA
2 days ago
Senior Software Engineer, Generative AI Systems
$152k - $241.5k
...NVIDIA is seeking a highly motivated Software Engineer to join our growing AI and Generative AI engineering team.... ...infrastructure for large‑scale ML training, inference, and generative... ...cloud‑native platforms supporting GPU clusters, fault‑tolerant training, and high‑...
NVIDIA Gruppe
Santa Clara, CA
11 hours ago
Senior Software Engineer, AI Inference Systems
$184k - $287.5k
...highly skilled and motivated software engineers to join us and build AI inference systems that... ...inference stacks, optimize GPU kernels and compilers, drive... ...deployments on GPU clusters across clouds. Conduct and... ...frontier for the field of ML Systems; survey recent publications...
NVIDIA Gruppe
Santa Clara, CA
11 hours ago
Senior Platform Engineer - AI/ML Infra & GPU Compute
...California, is seeking an experienced Platform Engineer to build and operate a hybrid infrastructure for advanced AI/ML research and product development. You will architect... .... Candidates should have over 5 years of Software Engineering experience, proficiency in scripting...
Sanas
Palo Alto, CA
12 hours ago
Senior Software Engineer II, AI Workload Orchestration
$165k - $242k
...is The Essential Cloud for AI™. Built for pioneers by pioneers... ...You'll Do: As a Senior Software Engineer II (IC4) on the AI... ...Improve scheduling latency, cluster utilization, and workload reliability... ...Familiarity with GPU-based workloads, ML training, or inference pipelines...
Permanent employment
Temporary work
Casual work
Work at office
Flexible hours
CoreWeave
Sunnyvale, CA
18 days ago
Senior Software Engineer, AI Infra
...RoboForce RoboForce is an AI robotics company building Physical... ...We are looking for a Senior Software Engineer to build scalable AI... ...will work across cloud systems, GPU clusters, data pipelines, and robotics... ...proficiency with C++, Python, and ML frameworks (e.g., PyTorch,...
Full time
Work at office
Visa sponsorship
RoboForce
Milpitas, CA
1 day ago
Senior Network Software Engineer — AI‑Driven Infra
$207k - $307k
A leading tech company in Sunnyvale, CA is seeking a Software Engineer to lead the design and architecture of customer-facing systems. Applicants... ...in software development, focusing on distributed systems and AI/ML applications. You will drive project outcomes, implement best...
Full time
Google Inc.
Sunnyvale, CA
4 days ago
Senior ML Infra Engineer for AI Validation Platform
...General Motors is looking for a Senior ML Infrastructure Engineer to build robust compute platforms for AI validation. This role emphasizes driving efficiency and maximizing GPU utilization while improving platform reliability. You will collaborate with engineers to shape...
General Motors
Sunnyvale, CA
1 day ago
Senior Software Engineer, GenAI & AI/ML Cloud Infra
A leading technology company in California is looking for a Senior Software Engineer to develop cutting-edge AI and ML solutions. Responsibilities include writing and testing code, collaborating through design and code reviews, and contributing to documentation. Candidates...
Full time
Google Inc.
Sunnyvale, CA
10 hours ago
Principal AI/ML Infra Leader (Remote)
...General Motors is hiring a Principal AI/ML Engineer to lead the AV ML Infra team in Mountain View, California. This leadership role involves shaping the vision and execution of AI and ML infrastructure, driving transformative projects, and mentoring engineers. Key qualifications...
Principal
Local area
Remote work
General Motors
Mountain View, CA
11 hours ago
Senior Backend Engineer AI Infra & GPU Scheduling (Hybrid)
...A tech company in Palo Alto is actively seeking a Backend Engineer to develop systems that manage GPU clusters for AI workloads. The role demands 5+ years of backend experience, with strong skills in Go or Python. You will design APIs for GPU orchestration and manage resources...
SproutsAI
Palo Alto, CA
11 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Principal AI and ML Infra Software Engineer, GPU Clusters. Be the first to apply!