Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Principal AI and ML Infra Software Engineer, GPU Clusters

$272k - $431.25k

NVIDIA

We are seeking a Principal AI and ML Infra Software Engineer, GPU Clusters at NVIDIA to join our Hardware Infrastructure team. As an Engineer, you will have a pivotal role in enhancing efficiency for our researchers by implementing progressions throughout the entire stack. Your main task will revolve around collaborating closely with customers to pinpoint and address infrastructure deficiencies, facilitating groundbreaking AI and ML research on GPU Clusters. Together, we can craft potent, effective, and scalable solutions as we mold the future of AI/ML technology!**What you will be doing:*** Engage closely with our AI and ML research teams to discern their infrastructure requirements and barriers, converting those insights into actionable improvements.* Proactively identify researcher efficiency bottlenecks and lead initiatives to systematically improve it. Drive the direction and long-term roadmaps for such initiatives.* Monitor and optimize the performance of our infrastructure ensuring high availability, scalability, and efficient resource utilization.* Help define and improve important measures of AI researcher efficiency, ensuring that our actions are in line with measurable results.* Work closely with a variety of teams, such as researchers, data engineers, and DevOps professionals, to develop a cohesive AI/ML infrastructure ecosystem.* Keep up to date with the most recent developments in AI/ML technologies, frameworks, and successful strategies, and advocate for their integration within the organization.**What we need to see:*** BS or similar background in Computer Science or related area (or equivalent experience).* 15+ years of demonstrated expertise in AI/ML and HPC tasks and systems.* Hands-on experience in using or operating High Performance Computing (HPC) grade infrastructure as well as in-depth knowledge of accelerated computing (e.g., GPU, custom silicon), storage (e.g., Lustre, GPFS, BeeGFS), scheduling & orchestration (e.g., Slurm, Kubernetes, LSF), high-speed networking (e.g., Infiniband, RoCE, Amazon EFA), and containers technologies (Docker, Enroot).* Capability in supervising and improving substantial distributed training operations using PyTorch (DDP, FSDP), NeMo, or JAX. Moreover, an in-depth understanding of AI/ML workflows, involving data processing, model training, and inference pipelines.* Proficiency in programming & scripting languages such as Python, Go, Bash, as well as familiarity with cloud computing platforms (e.g., AWS, GCP, Azure) in addition to experience with parallel computing frameworks and paradigms.* Dedication to ongoing learning and staying updated on new technologies and innovative methods in the AI/ML infrastructure sector.* Excellent communication and collaboration skills, with the ability to work effectively with teams and individuals of different backgrounds.NVIDIA offers competitive salaries and a comprehensive benefits package. Our engineering teams are growing rapidly due to outstanding expansion. If you're a passionate and independent engineer with a love for technology, we want to hear from you.Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 272,000 USD - 431,250 USD.You will also be eligible for equity and .Applications for this job will be accepted at least until May 1, 2026.This posting is for an existing vacancy.NVIDIA uses AI tools in its recruiting processes.NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. #J-18808-Ljbffr

Vacancy posted 11 hours ago
Similar jobs that could be interesting for youBased on the Principal AI and ML Infra Software Engineer, GPU Clusters in Santa Clara, CA vacancy
  • $272k - $431.25k

     ...We are seeking a Principal AI and ML Infra Software Engineer, GPU Clusters at NVIDIA to join our Hardware Infrastructure team. As an Engineer, you will have a pivotal role in enhancing efficiency for our researchers by implementing progressions throughout the entire stack... 
    Principal

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  •  ...NVIDIA Gruppe is seeking a Principal AI and ML Infra Software Engineer to join our Hardware Infrastructure team in Santa Clara, CA. In this role, you...  ...efficiency by addressing infrastructure deficiencies for GPU Clusters, fostering innovations in AI/ML research. The ideal... 
    Principal

    NVIDIA Gruppe

    Santa Clara, CA
    11 hours ago
  • $272k - $431.25k

     ...NVIDIA Corporation seeks a Principal AI and ML Infra Software Engineer in Santa Clara, California, to enhance the efficiency of AI/ML research on GPU Clusters. The role involves collaboration with various teams, monitoring infrastructure performance, and implementing improvements... 
    Principal

    NVIDIA

    Santa Clara, CA
    11 hours ago
  • $152k - $241.5k

     ...Visualization. Our invention—the GPU—functions as the visual...  ...from generative AI to autonomous vehicles....  ...looking for a Senior Software Engineer to help accelerate the...  ...performance‑optimal GPU clusters to internal researchers...  ...the most advanced ML models on some of the world... 
    Suggested

    NVIDIA Gruppe

    Santa Clara, CA
    12 hours ago
  •  ...computing experiences-from AI and data centers, to...  ...a Senior Staff AI Infra Engineer who is passionate...  ...special focus on AI/ML workloads and GPU-accelerated computing...  ...intersection of hardware and software to optimize...  ...AI workloads on GPU clusters, including large-scale... 
    Principal

    Advanced Micro Devices , Inc.

    Santa Clara, CA
    5 days ago
  • $275.8k - $340.5k

     ...Position Overview The Principal AI/ML Engineer will lead a growing organization, guiding the AV ML Infra team in achieving its mission while shaping long‑term vision and execution strategies across GM’s AI and ML efforts. This leadership role will drive a transformative... 
    Principal
    Local area
    Remote work
    Relocation
    Relocation package
    Flexible hours

    General Motors

    Sunnyvale, CA
    10 hours ago
  • $152k - $287.5k

    A leading technology company is seeking a Senior Software Engineer to develop solutions for GPU clusters aimed at enhancing machine learning innovation. The ideal...  ...engineering with significant involvement in ML infrastructure, strong coding skills in Python, C++,... 

    NVIDIA Corporation

    Santa Clara, CA
    4 days ago
  • $275.8k - $340.5k

     ...About the team: The AV ML Infra team at GM builds ML infrastructure...  ...meet the unique demands of AI and ML innovation, supporting...  ...the productivity of ML engineers, and drive the adoption of cutting...  ...Position Overview: The Principal AI/ML Engineer will lead a growing... 
    Principal
    Local area
    Remote work
    Work from home
    Relocation
    Relocation package
    Flexible hours

    General Motors

    Sunnyvale, CA
    5 days ago
  • $212.8k - $387.6k

     ...years experience as a senior development engineer. 3. 3 years experience of building large-scale...  ...understanding of cloud infrastructure or AI infrastructure. 6. Familiar with at least one of the areas below: GPU Infra (GPU cluster management, job scheduling, collective... 
    Temporary work
    Local area

    ByteDance

    San Jose, CA
    4 days ago
  • $140k - $252k

     ...What to Expect As a Software Engineer within the Supercomputing AI Infrastructure team, you will...  ...optimizing our training compute clusters at the core of Robotaxi...  ...currently scaling 100K+ GPU clusters, which are...  ...Work closely with the ML team to understand workload... 
    Hourly pay
    Full time
    Temporary work
    Flexible hours

    Tesla

    Palo Alto, CA
    1 day ago
  • Advanced Micro Devices is looking for a Principal Engineer in Santa Clara, CA to lead AI infrastructure development, define GPU architecture specifications, and drive performance gains in ML systems. The role involves leading innovative techniques, collaborating with stakeholders... 
    Principal

    Advanced Micro Devices

    Santa Clara, CA
    2 days ago
  • $275.8k - $340.5k

     ...About the Team The AV ML Infra team at GM builds ML infrastructure...  ...to meet the unique demands of AI and ML innovation, supporting...  ...enhance the productivity of ML engineers, and drive the adoption of...  ...techniques. Position Overview The Principal AI/ML Engineer will lead a... 
    Principal
    Remote work
    Relocation
    Relocation package
    Flexible hours

    General Motors

    Mountain View, CA
    11 hours ago
  • $152k - $287.5k

     ...NVIDIA Gruppe, based in Santa Clara, is seeking a Senior Software Engineer to accelerate the development of machine learning innovations. In this role, you'll design and implement solutions for GPU clusters, enabling researchers to optimize their work. Strong expertise... 

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • $296.3k

     ...Foundations team in Embodied AI and is responsible for...  ...high‑impact team of AI/ML engineers, data scientists and...  ...vehicles. Role As a Principal Engineer in the Embodied...  ...pipelines on modern cloud / GPU infrastructure, with...  .../Mining/Quality and Infra Foundations to turn evaluation... 
    Principal
    Local area
    Flexible hours

    General Motors

    Sunnyvale, CA
    4 days ago
  •  ...Tensec is seeking an experienced Platform Engineer to develop and operate a hybrid infrastructure for AI/ML research and product development. You will architect and...  ...enhancing the platform’s performance for high-demand GPU workloads, directly impacting AI model deployments... 

    Tensec

    Palo Alto, CA
    1 day ago
  • $152k - $241.5k

     ...We are seeking a Senior Software Engineer to drive integration of...  ...of leading open-source AI frameworks. In this...  ...including multi-node and multi-GPU environments. Improve...  ...have), and debugging clusters. Familiarity with...  ...ecosystem, or related ML infrastructure projects... 

    NVIDIA Gruppe

    Santa Clara, CA
    12 hours ago
  • $139k - $204k

     ...Senior Software Engineer, Cluster Orchestration CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers...  ...efficiently across massive GPU clusters. By building the systems...  ..., GPU-based applications, or ML pipelines. Knowledge of... 
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Remote work
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    3 days ago
  •  ...computing experiences-from AI and data centers, to...  ...latest hardware and software technology. THE PERSON...  ...from the lowest-level GPU kernels to large-scale...  ...deep passion for software engineering, strong technical...  ...the C++/HIP/CUDA core of ML frameworks like PyTorch... 

    Advanced Micro Devices , Inc.

    Santa Clara, CA
    5 days ago
  • $184k - $356.5k

     ...NVIDIA Corporation is seeking a Senior Software Engineer in Santa Clara to enhance the performance and reliability of large-scale AI infrastructures. The role involves leadership...  ...distributed training workloads across NVIDIA’s GPU platforms. Ideal candidates should have... 

    NVIDIA

    Santa Clara, CA
    1 day ago
  • $152k - $241.5k

     ...NVIDIA seeks a senior software engineer to join the AI Networking co-design and benchmark R&D team. In...  ...tools. These include tools that use ML-based combinatorial optimization and...  ...optimize AI workloads across large GPU and CPU clusters, thereby ensuring the most... 

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $152k - $241.5k

     ...NVIDIA is seeking a highly motivated Software Engineer to join our growing AI and Generative AI engineering team....  ...infrastructure for large‑scale ML training, inference, and generative...  ...cloud‑native platforms supporting GPU clusters, fault‑tolerant training, and high‑... 

    NVIDIA Gruppe

    Santa Clara, CA
    11 hours ago
  • $184k - $287.5k

     ...highly skilled and motivated software engineers to join us and build AI inference systems that...  ...inference stacks, optimize GPU kernels and compilers, drive...  ...deployments on GPU clusters across clouds. Conduct and...  ...frontier for the field of ML Systems; survey recent publications... 

    NVIDIA Gruppe

    Santa Clara, CA
    11 hours ago
  •  ...California, is seeking an experienced Platform Engineer to build and operate a hybrid infrastructure for advanced AI/ML research and product development. You will architect...  .... Candidates should have over 5 years of Software Engineering experience, proficiency in scripting... 

    Sanas

    Palo Alto, CA
    12 hours ago
  • $165k - $242k

     ...is The Essential Cloud for AI™. Built for pioneers by pioneers...  ...You'll Do: As a Senior Software Engineer II (IC4) on the AI...  ...Improve scheduling latency, cluster utilization, and workload reliability...  ...Familiarity with GPU-based workloads, ML training, or inference pipelines... 
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    18 days ago
  •  ...RoboForce RoboForce is an AI robotics company building Physical...  ...We are looking for a Senior Software Engineer to build scalable AI...  ...will work across cloud systems, GPU clusters, data pipelines, and robotics...  ...proficiency with C++, Python, and ML frameworks (e.g., PyTorch,... 
    Full time
    Work at office
    Visa sponsorship

    RoboForce

    Milpitas, CA
    1 day ago
  • $207k - $307k

    A leading tech company in Sunnyvale, CA is seeking a Software Engineer to lead the design and architecture of customer-facing systems. Applicants...  ...in software development, focusing on distributed systems and AI/ML applications. You will drive project outcomes, implement best... 
    Full time

    Google Inc.

    Sunnyvale, CA
    4 days ago
  •  ...General Motors is looking for a Senior ML Infrastructure Engineer to build robust compute platforms for AI validation. This role emphasizes driving efficiency and maximizing GPU utilization while improving platform reliability. You will collaborate with engineers to shape... 

    General Motors

    Sunnyvale, CA
    1 day ago
  • A leading technology company in California is looking for a Senior Software Engineer to develop cutting-edge AI and ML solutions. Responsibilities include writing and testing code, collaborating through design and code reviews, and contributing to documentation. Candidates... 
    Full time

    Google Inc.

    Sunnyvale, CA
    10 hours ago
  •  ...General Motors is hiring a Principal AI/ML Engineer to lead the AV ML Infra team in Mountain View, California. This leadership role involves shaping the vision and execution of AI and ML infrastructure, driving transformative projects, and mentoring engineers. Key qualifications... 
    Principal
    Local area
    Remote work

    General Motors

    Mountain View, CA
    11 hours ago
  •  ...A tech company in Palo Alto is actively seeking a Backend Engineer to develop systems that manage GPU clusters for AI workloads. The role demands 5+ years of backend experience, with strong skills in Go or Python. You will design APIs for GPU orchestration and manage resources... 

    SproutsAI

    Palo Alto, CA
    11 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Principal AI and ML Infra Software Engineer, GPU Clusters. Be the first to apply!