Principal AI and ML Infra Software Engineer, GPU Clusters

$272k - $431.25k

NVIDIA Gruppe

We are seeking a Principal AI and ML Infra Software Engineer, GPU Clusters at NVIDIA to join our Hardware Infrastructure team. As an Engineer, you will have a pivotal role in enhancing efficiency for our researchers by implementing progressions throughout the entire stack. Your main task will revolve around collaborating closely with customers to pinpoint and address infrastructure deficiencies, facilitating groundbreaking AI and ML research on GPU Clusters. Together, we can craft potent, effective, and scalable solutions as we mold the future of AI/ML technology! Responsibilities Engage closely with our AI and ML research teams to discern their infrastructure requirements and barriers, converting those insights into actionable improvements. Proactively identify researcher efficiency bottlenecks and lead initiatives to systematically improve it. Drive the direction and long‑term roadmaps for such initiatives. Monitor and optimize the performance of our infrastructure ensuring high availability, scalability, and efficient resource utilization. Help define and improve important measures of AI researcher efficiency, ensuring that our actions are in line with measurable results. Work closely with a variety of teams, such as researchers, data engineers, and DevOps professionals, to develop a cohesive AI/ML infrastructure ecosystem. Keep up to date with the most recent developments in AI/ML technologies, frameworks, and successful strategies, and advocate for their integration within the organization. Qualifications BS or similar background in Computer Science or related area (or equivalent experience). 15+ years of demonstrated expertise in AI/ML and HPC tasks and systems. Hands‑on experience in using or operating High Performance Computing (HPC) grade infrastructure as well as in‑depth knowledge of accelerated computing (e.g., GPU, custom silicon), storage (e.g., Lustre, GPFS, BeeGFS), scheduling & orchestration (e.g., Slurm, Kubernetes, LSF), high‑speed networking (e.g., Infiniband, RoCE, Amazon EFA), and containers technologies (Docker, Enroot). Capability in supervising and improving substantial distributed training operations using PyTorch (DDP, FSDP), NeMo, or JAX. Moreover, an in‑depth understanding of AI/ML workflows, involving data processing, model training, and inference pipelines. Proficiency in programming & scripting languages such as Python, Go, Bash, as well as familiarity with cloud computing platforms (e.g., AWS, GCP, Azure) in addition to experience with parallel computing frameworks and paradigms. Dedication to ongoing learning and staying updated on new technologies and innovative methods in the AI/ML infrastructure sector. Excellent communication and collaboration skills, with the ability to work effectively with teams and individuals of different backgrounds. Benefits NVIDIA offers competitive salaries and a comprehensive benefits package. Our engineering teams are growing rapidly due to outstanding expansion. If you're a passionate and independent engineer with a love for technology, we want to hear from you. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 272,000 USD - 431,250 USD. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until May 1, 2026. This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. #J-18808-Ljbffr NVIDIA Gruppe

Apply

Vacancy posted 4 days ago

Similar jobs that could be interesting for youBased on the Principal AI and ML Infra Software Engineer, GPU Clusters in Santa Clara, CA vacancy

Principal AI/ML Infra Engineer for GPU Clusters
...NVIDIA Gruppe is seeking a Principal AI and ML Infra Software Engineer to join our Hardware Infrastructure team in Santa Clara, CA. In this role, you... ...efficiency by addressing infrastructure deficiencies for GPU Clusters, fostering innovations in AI/ML research. The...
Principal
Jobleads-US
Santa Clara, CA
4 days ago
Principal AI/ML Infra Engineer — GPU Clusters & HPC
$272k - $431.25k
NVIDIA Corporation seeks a Principal AI and ML Infra Software Engineer in Santa Clara, California, to enhance the efficiency of AI/ML research on GPU Clusters. The role involves collaboration with various teams, monitoring infrastructure performance, and implementing improvements...
Principal
NVIDIA Corporation
Santa Clara, CA
4 days ago
Senior AI Infra Engineer-Distributed GPU Clusters (Equity)
NVIDIA Gruppe in Santa Clara is seeking a Senior Software Engineer to lead the optimization of distributed training across large-scale GPU platforms. Candidates should have substantial experience in AI applications and technical leadership. This role involves profiling...
Suggested
NVIDIA Gruppe
Santa Clara, CA
2 days ago
Principal AI/ML Engineer, AV ML Infra
$275.8k - $340.5k
...Position Overview The Principal AI/ML Engineer will lead a growing organization, guiding the AV ML Infra team in achieving its mission while shaping long‑term vision and execution strategies across GM’s AI and ML efforts. This leadership role will drive a transformative...
Principal
Local area
Remote work
Relocation
Relocation package
Flexible hours
General Motors
Sunnyvale, CA
1 day ago
Principal AI/ML Engineer, AV ML Infra
$275.8k - $340.5k
...About the team: The AV ML Infra team at GM builds ML infrastructure... ...meet the unique demands of AI and ML innovation, supporting... ...the productivity of ML engineers, and drive the adoption of cutting... ...Position Overview: The Principal AI/ML Engineer will lead a growing...
Principal
Local area
Remote work
Work from home
Relocation
Relocation package
Flexible hours
General Motors
Sunnyvale, CA
5 days ago
Principal AI Performance Architect for Scalable GPU Training
Advanced Micro Devices is looking for a Principal Engineer in Santa Clara, CA to lead AI infrastructure development, define GPU architecture specifications, and drive performance gains in ML systems. The role involves leading innovative techniques, collaborating with stakeholders...
Principal
Advanced Micro Devices , Inc.
Santa Clara, CA
2 days ago
Principal AI/ML Engineer, AV ML Infra
$275.8k - $340.5k
...About the Team The AV ML Infra team at GM builds ML infrastructure... ...to meet the unique demands of AI and ML innovation, supporting... ...enhance the productivity of ML engineers, and drive the adoption of... ...techniques. Position Overview The Principal AI/ML Engineer will lead a...
Principal
Remote work
Relocation
Relocation package
Flexible hours
General Motors
Mountain View, CA
5 days ago
Principal Tech Lead Manager - Embodied AI Evaluation Foundations
$296.3k
...Foundations team in Embodied AI and is responsible for... ...high‑impact team of AI/ML engineers, data scientists and... ...vehicles. Role As a Principal Engineer in the Embodied... ...pipelines on modern cloud / GPU infrastructure, with... .../Mining/Quality and Infra Foundations to turn evaluation...
Principal
Local area
Flexible hours
General Motors
Sunnyvale, CA
4 days ago
Senior Security Software Engineer - AI Infra & Clusters (Equity)
NVIDIA Gruppe in Santa Clara, California is seeking a Senior Software Engineer for their security team. This role involves developing and enforcing... ...across cutting-edge computing environments, including AI infrastructures. The ideal candidate will have strong experience...
NVIDIA Gruppe
Santa Clara, CA
2 days ago
Senior AI Compiler Engineer — GPU Optimization
$152k - $241.5k
NVIDIA seeks an experienced engineer for AI-based GPU compiler technology in Santa Clara, California. The role involves designing technology for GPU... ...candidate holds an M.S. or Ph.D., has over 5 years in AI/ML, and skills in Python and C++. Competitive salaries range from...
NVIDIA
Santa Clara, CA
4 days ago
Senior AI Infrastructure Engineer, Large-Scale GPU Clusters
NVIDIA Corporation in Santa Clara is seeking a Senior Software Engineer to lead the optimization of large-scale AI systems. This role will involve profiling and... ...Responsibilities include leading the debugging process of multi-GPU environments and mentoring less experienced...
NVIDIA Corporation
Santa Clara, CA
2 days ago
Senior Staff Software Development Engineer- GPU/AI/ML
...generation computing experiences—from AI and data centers, to PCs,... ...is looking for an influential software engineer who is passionate about... ...performance from the lowest-level GPU kernels to large-scale distributed... ..., or the C++/HIP/CUDA core of ML frameworks like PyTorch,...
Advanced Micro Devices , Inc.
Santa Clara, CA
5 days ago
Senior Software Engineer, GenAI & AI/ML Cloud Infra
A leading technology company in California is looking for a Senior Software Engineer to develop cutting-edge AI and ML solutions. Responsibilities include writing and testing code, collaborating through design and code reviews, and contributing to documentation. Candidates...
Full time
Google Inc.
Sunnyvale, CA
5 days ago
Senior Software Engineer, Generative AI Systems
$152k - $241.5k
...NVIDIA is seeking a highly motivated Software Engineer to join our growing AI and Generative AI engineering team.... ...infrastructure for large‑scale ML training, inference, and generative... ...cloud‑native platforms supporting GPU clusters, fault‑tolerant training, and high‑...
NVIDIA Gruppe
Santa Clara, CA
1 day ago
Senior Software Engineer, AI Inference Systems
$184k - $287.5k
...highly skilled and motivated software engineers to join us and build AI inference systems that... ...inference stacks, optimize GPU kernels and compilers, drive... ...deployments on GPU clusters across clouds. Conduct and... ...frontier for the field of ML Systems; survey recent publications...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior Principal AI Agent / ML Software Engineer (OCI)
...Senior Principal AI Agent / ML Software Engineer The Senior Principal AI Agent / ML Software Engineer is a Senior Staff-level, hands-on technical leadership... ...services optimized for low latency, high throughput, GPU efficiency, reliability, cost, operability, and secure...
Principal
Flexible hours
Oracle
Santa Clara, CA
1 day ago
Senior ML Platform Engineer Scale GPU AI Infra (Remote, Equity)
$152k - $241.5k
NVIDIA Corporation is seeking a Senior ML Platform Engineer to design and scale high-performance ML infrastructure. You'll utilize IaC techniques with Ansible and Terraform, collaborating closely with ML researchers and ensuring system reliability and performance. This...
Remote job
NVIDIA
Santa Clara, CA
4 days ago
Senior AI/ML Engineer - AV Infra
$170.6k - $261.3k
...Job Description Senior AI/ML Engineer, AV ML Infra We're General Motors (GM), a company driving the future of mobility with advanced self-driving and electric vehicle technologies. We're building the world's most innovative autonomous vehicles to safely connect...
Local area
Work from home
Flexible hours
General Motors
Sunnyvale, CA
2 days ago
Principal AI/ML Infra Leader (Remote)
General Motors is hiring a Principal AI/ML Engineer to lead the AV ML Infra team in Mountain View, California. This leadership role involves shaping the vision and execution of AI and ML infrastructure, driving transformative projects, and mentoring engineers. Key qualifications...
Principal
Remote job
Local area
General Motors
Mountain View, CA
4 days ago
Principal Network Automation Architect - AI-Driven Infra
NVIDIA Gruppe in Santa Clara is seeking a Principal AI/ML Engineer to lead the development of automated network platforms crucial for our cloud and datacenter operations. This role requires a deep expertise in Python and Golang, an understanding of networking fundamentals...
Principal
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior Backend Engineer AI Infra & GPU Scheduling (Hybrid)
A tech company in Palo Alto is actively seeking a Backend Engineer to develop systems that manage GPU clusters for AI workloads. The role demands 5+ years of backend experience, with strong skills in Go or Python. You will design APIs for GPU orchestration and manage resources...
SproutsAI
Palo Alto, CA
3 days ago
Principal Software Engineer - Large-Scale LLM Memory and Storage Systems
$272k - $425.5k
Principal Software Engineer – Large-Scale LLM Memory and Storage Systems... ...for serving generative AI and reasoning models... ..., Dynamo orchestrates GPU shards, routes requests... ...across heterogeneous clusters so that many accelerators... ...storage, or ML systems infrastructure...
Principal
Local area
Remote work
NVIDIA Corporation
Santa Clara, CA
2 days ago
Senior AI/HPC GPU Cluster Architect (Equity)
NVIDIA Gruppe in Santa Clara is seeking a technical leader for the GPU AI/HPC Infrastructure team. You will design and implement cutting-edge GPU compute clusters, focusing on deep learning and high-performance computing. The ideal candidate will have at least 5+ years...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior GPU Cluster Architect for AI/HPC Deployments
$184k - $356.5k
NVIDIA Gruppe is seeking an experienced engineer to lead GPU cluster design and support for AI and HPC deployments in Santa Clara, California. The ideal candidate will have over 8 years of experience with large-scale GPU infrastructure and a strong ability to communicate...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior AI Application Developer - GPU and SOC Architecture Modeling
$152k - $241.5k
...computing model focused on visual and AI computing. For two decades,... ..., with our invention of the GPU. The GPU has also shown to be... ...is looking for Architects, Software Engineers, and AI application developers... ...Proficiency in C++, Python and ML frameworks like LangChain, LangSmith...
Full time
NVIDIA
Santa Clara, CA
20 hours ago
Principal AI Engineer: Scalable LLM Agents for GTM
$167k - $270.5k
Palo Alto Networks, Inc. is seeking a Technical Leader to develop AI applications within the GTM/CX domain. This role involves defining the architecture for scalable AI/ML systems and leading the design of intelligent agents. Ideal candidates will have 15+ years of experience...
Principal
Palo Alto Networks, Inc.
Santa Clara, CA
1 day ago
Senior Principal Software Engineer — AI/ML Cloud-Native
Gigamon, located in Santa Clara, CA, is seeking a Senior Principal Software Engineer to lead the design and development of AI/ML-driven applications for network monitoring and security. The role requires strong programming expertise in Java and experience in building scalable...
Principal
Gigamon
Santa Clara, CA
1 day ago
Principal AI Inference Engineer Open-Source & GPU-Focused
$272k - $431.25k
NVIDIA Gruppe is looking for a Principal Software Engineer to advance open-source AI inference. This hands-on role emphasizes running high-performance inference on NVIDIA platforms and involves collaboration across various teams. Key responsibilities include optimizing...
Principal
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior Software Engineer, AI Performance Analysis
$168k - $270.25k
Overview NVIDIA GPU Architecture Group is seeking a senior software engineer to automate and optimize performance analysis workflows for AI training and inference workloads. You will not only perform... ...developer tools or platforms for ML engineers Contributions to open-...
Work experience placement
NVIDIA
Santa Clara, CA
4 days ago
Senior Software Engineer, AI and DL Kernel Libraries
$184k - $287.5k
We're looking for outstanding AI systems engineers to develop groundbreaking... ...technologies in the inference systems software stack! We build innovative AI... ..., code generators, and GPU kernel technologies for NVIDIA... .../ industry) experience with ML/DL systems development preferable...
NVIDIA Gruppe
Santa Clara, CA
4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Principal AI and ML Infra Software Engineer, GPU Clusters. Be the first to apply!