Principal AI and ML Infra Software Engineer, GPU Clusters
$272k - $431.25kNVIDIA
We are seeking a Principal AI and ML Infra Software Engineer, GPU Clusters at NVIDIA to join our Hardware Infrastructure team. As an Engineer, you will have a pivotal role in enhancing efficiency for our researchers by implementing progressions throughout the entire stack. Your main task will revolve around collaborating closely with customers to pinpoint and address infrastructure deficiencies, facilitating groundbreaking AI and ML research on GPU Clusters. Together, we can craft potent, effective, and scalable solutions as we mold the future of AI/ML technology!**What you will be doing:*** Engage closely with our AI and ML research teams to discern their infrastructure requirements and barriers, converting those insights into actionable improvements.* Proactively identify researcher efficiency bottlenecks and lead initiatives to systematically improve it. Drive the direction and long-term roadmaps for such initiatives.* Monitor and optimize the performance of our infrastructure ensuring high availability, scalability, and efficient resource utilization.* Help define and improve important measures of AI researcher efficiency, ensuring that our actions are in line with measurable results.* Work closely with a variety of teams, such as researchers, data engineers, and DevOps professionals, to develop a cohesive AI/ML infrastructure ecosystem.* Keep up to date with the most recent developments in AI/ML technologies, frameworks, and successful strategies, and advocate for their integration within the organization.**What we need to see:*** BS or similar background in Computer Science or related area (or equivalent experience).* 15+ years of demonstrated expertise in AI/ML and HPC tasks and systems.* Hands-on experience in using or operating High Performance Computing (HPC) grade infrastructure as well as in-depth knowledge of accelerated computing (e.g., GPU, custom silicon), storage (e.g., Lustre, GPFS, BeeGFS), scheduling & orchestration (e.g., Slurm, Kubernetes, LSF), high-speed networking (e.g., Infiniband, RoCE, Amazon EFA), and containers technologies (Docker, Enroot).* Capability in supervising and improving substantial distributed training operations using PyTorch (DDP, FSDP), NeMo, or JAX. Moreover, an in-depth understanding of AI/ML workflows, involving data processing, model training, and inference pipelines.* Proficiency in programming & scripting languages such as Python, Go, Bash, as well as familiarity with cloud computing platforms (e.g., AWS, GCP, Azure) in addition to experience with parallel computing frameworks and paradigms.* Dedication to ongoing learning and staying updated on new technologies and innovative methods in the AI/ML infrastructure sector.* Excellent communication and collaboration skills, with the ability to work effectively with teams and individuals of different backgrounds.NVIDIA offers competitive salaries and a comprehensive benefits package. Our engineering teams are growing rapidly due to outstanding expansion. If you're a passionate and independent engineer with a love for technology, we want to hear from you.Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 272,000 USD - 431,250 USD.You will also be eligible for equity and .Applications for this job will be accepted at least until May 1, 2026.This posting is for an existing vacancy.NVIDIA uses AI tools in its recruiting processes.NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. #J-18808-Ljbffr
$272k - $431.25k
...We are seeking a Principal AI and ML Infra Software Engineer, GPU Clusters at NVIDIA to join our Hardware Infrastructure team. As an Engineer, you will have a pivotal role in enhancing efficiency for our researchers by implementing progressions throughout the entire stack...Principal- ...NVIDIA Gruppe is seeking a Principal AI and ML Infra Software Engineer to join our Hardware Infrastructure team in Santa Clara, CA. In this role, you... ...efficiency by addressing infrastructure deficiencies for GPU Clusters, fostering innovations in AI/ML research. The ideal...Principal
$272k - $431.25k
...NVIDIA Corporation seeks a Principal AI and ML Infra Software Engineer in Santa Clara, California, to enhance the efficiency of AI/ML research on GPU Clusters. The role involves collaboration with various teams, monitoring infrastructure performance, and implementing improvements...Principal$152k - $241.5k
...Visualization. Our invention—the GPU—functions as the visual... ...from generative AI to autonomous vehicles.... ...looking for a Senior Software Engineer to help accelerate the... ...performance‑optimal GPU clusters to internal researchers... ...the most advanced ML models on some of the world...Suggested- ...computing experiences-from AI and data centers, to... ...a Senior Staff AI Infra Engineer who is passionate... ...special focus on AI/ML workloads and GPU-accelerated computing... ...intersection of hardware and software to optimize... ...AI workloads on GPU clusters, including large-scale...Principal
$275.8k - $340.5k
...Position Overview The Principal AI/ML Engineer will lead a growing organization, guiding the AV ML Infra team in achieving its mission while shaping long‑term vision and execution strategies across GM’s AI and ML efforts. This leadership role will drive a transformative...PrincipalLocal areaRemote workRelocationRelocation packageFlexible hours$152k - $287.5k
A leading technology company is seeking a Senior Software Engineer to develop solutions for GPU clusters aimed at enhancing machine learning innovation. The ideal... ...engineering with significant involvement in ML infrastructure, strong coding skills in Python, C++,...$275.8k - $340.5k
...About the team: The AV ML Infra team at GM builds ML infrastructure... ...meet the unique demands of AI and ML innovation, supporting... ...the productivity of ML engineers, and drive the adoption of cutting... ...Position Overview: The Principal AI/ML Engineer will lead a growing...PrincipalLocal areaRemote workWork from homeRelocationRelocation packageFlexible hours$212.8k - $387.6k
...years experience as a senior development engineer. 3. 3 years experience of building large-scale... ...understanding of cloud infrastructure or AI infrastructure. 6. Familiar with at least one of the areas below: GPU Infra (GPU cluster management, job scheduling, collective...Temporary workLocal area$140k - $252k
...What to Expect As a Software Engineer within the Supercomputing AI Infrastructure team, you will... ...optimizing our training compute clusters at the core of Robotaxi... ...currently scaling 100K+ GPU clusters, which are... ...Work closely with the ML team to understand workload...Hourly payFull timeTemporary workFlexible hours- Advanced Micro Devices is looking for a Principal Engineer in Santa Clara, CA to lead AI infrastructure development, define GPU architecture specifications, and drive performance gains in ML systems. The role involves leading innovative techniques, collaborating with stakeholders...Principal
$275.8k - $340.5k
...About the Team The AV ML Infra team at GM builds ML infrastructure... ...to meet the unique demands of AI and ML innovation, supporting... ...enhance the productivity of ML engineers, and drive the adoption of... ...techniques. Position Overview The Principal AI/ML Engineer will lead a...PrincipalRemote workRelocationRelocation packageFlexible hours$152k - $287.5k
...NVIDIA Gruppe, based in Santa Clara, is seeking a Senior Software Engineer to accelerate the development of machine learning innovations. In this role, you'll design and implement solutions for GPU clusters, enabling researchers to optimize their work. Strong expertise...$296.3k
...Foundations team in Embodied AI and is responsible for... ...high‑impact team of AI/ML engineers, data scientists and... ...vehicles. Role As a Principal Engineer in the Embodied... ...pipelines on modern cloud / GPU infrastructure, with... .../Mining/Quality and Infra Foundations to turn evaluation...PrincipalLocal areaFlexible hours- ...Tensec is seeking an experienced Platform Engineer to develop and operate a hybrid infrastructure for AI/ML research and product development. You will architect and... ...enhancing the platform’s performance for high-demand GPU workloads, directly impacting AI model deployments...
$152k - $241.5k
...We are seeking a Senior Software Engineer to drive integration of... ...of leading open-source AI frameworks. In this... ...including multi-node and multi-GPU environments. Improve... ...have), and debugging clusters. Familiarity with... ...ecosystem, or related ML infrastructure projects...$139k - $204k
...Senior Software Engineer, Cluster Orchestration CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers... ...efficiently across massive GPU clusters. By building the systems... ..., GPU-based applications, or ML pipelines. Knowledge of...Permanent employmentTemporary workCasual workWork at officeRemote workFlexible hours- ...computing experiences-from AI and data centers, to... ...latest hardware and software technology. THE PERSON... ...from the lowest-level GPU kernels to large-scale... ...deep passion for software engineering, strong technical... ...the C++/HIP/CUDA core of ML frameworks like PyTorch...
$184k - $356.5k
...NVIDIA Corporation is seeking a Senior Software Engineer in Santa Clara to enhance the performance and reliability of large-scale AI infrastructures. The role involves leadership... ...distributed training workloads across NVIDIA ’s GPU platforms. Ideal candidates should have...$152k - $241.5k
...NVIDIA seeks a senior software engineer to join the AI Networking co-design and benchmark R&D team. In... ...tools. These include tools that use ML-based combinatorial optimization and... ...optimize AI workloads across large GPU and CPU clusters, thereby ensuring the most...$152k - $241.5k
...NVIDIA is seeking a highly motivated Software Engineer to join our growing AI and Generative AI engineering team.... ...infrastructure for large‑scale ML training, inference, and generative... ...cloud‑native platforms supporting GPU clusters, fault‑tolerant training, and high‑...$184k - $287.5k
...highly skilled and motivated software engineers to join us and build AI inference systems that... ...inference stacks, optimize GPU kernels and compilers, drive... ...deployments on GPU clusters across clouds. Conduct and... ...frontier for the field of ML Systems; survey recent publications...- ...California, is seeking an experienced Platform Engineer to build and operate a hybrid infrastructure for advanced AI/ML research and product development. You will architect... .... Candidates should have over 5 years of Software Engineering experience, proficiency in scripting...
$165k - $242k
...is The Essential Cloud for AI™. Built for pioneers by pioneers... ...You'll Do: As a Senior Software Engineer II (IC4) on the AI... ...Improve scheduling latency, cluster utilization, and workload reliability... ...Familiarity with GPU-based workloads, ML training, or inference pipelines...Permanent employmentTemporary workCasual workWork at officeFlexible hours- ...RoboForce RoboForce is an AI robotics company building Physical... ...We are looking for a Senior Software Engineer to build scalable AI... ...will work across cloud systems, GPU clusters, data pipelines, and robotics... ...proficiency with C++, Python, and ML frameworks (e.g., PyTorch,...Full timeWork at officeVisa sponsorship
$207k - $307k
A leading tech company in Sunnyvale, CA is seeking a Software Engineer to lead the design and architecture of customer-facing systems. Applicants... ...in software development, focusing on distributed systems and AI/ML applications. You will drive project outcomes, implement best...Full time- ...General Motors is looking for a Senior ML Infrastructure Engineer to build robust compute platforms for AI validation. This role emphasizes driving efficiency and maximizing GPU utilization while improving platform reliability. You will collaborate with engineers to shape...
- A leading technology company in California is looking for a Senior Software Engineer to develop cutting-edge AI and ML solutions. Responsibilities include writing and testing code, collaborating through design and code reviews, and contributing to documentation. Candidates...Full time
- ...General Motors is hiring a Principal AI/ML Engineer to lead the AV ML Infra team in Mountain View, California. This leadership role involves shaping the vision and execution of AI and ML infrastructure, driving transformative projects, and mentoring engineers. Key qualifications...PrincipalLocal areaRemote work
- ...A tech company in Palo Alto is actively seeking a Backend Engineer to develop systems that manage GPU clusters for AI workloads. The role demands 5+ years of backend experience, with strong skills in Go or Python. You will design APIs for GPU orchestration and manage resources...
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Principal AI and ML Infra Software Engineer, GPU Clusters. Be the first to apply!
- machine learning software engineer Santa Clara, CA
- ai ml engineer Santa Clara, CA
- computer vision machine learning engineer Santa Clara, CA
- machine learning engineer Santa Clara, CA
- senior ml engineer Santa Clara, CA
- machine learning ai engineer Santa Clara, CA
- software engineer internship remote Santa Clara, CA
- new grad software engineer Santa Clara, CA
- software engineer staff Santa Clara, CA
- integration software engineer Santa Clara, CA


