Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Principal AI and ML Infra Software Engineer, GPU Clusters

$272k - $431.25k

Dormont Manufacturing Co

We are seeking a Principal AI and ML Infra Software Engineer, GPU Clusters at NVIDIA to join our Hardware Infrastructure team. As an Engineer, you will have a pivotal role in enhancing efficiency for our researchers by implementing progressions throughout the entire stack. Your main task will revolve around collaborating closely with customers to pinpoint and address infrastructure deficiencies, facilitating groundbreaking AI and ML research on GPU Clusters. Together, we can craft potent, effective, and scalable solutions as we mold the future of AI/ML technology!

What you will be doing:

  • Engage closely with our AI and ML research teams to discern their infrastructure requirements and barriers, converting those insights into actionable improvements.
  • Proactively identify researcher efficiency bottlenecks and lead initiatives to systematically improve it. Drive the direction and long-term roadmaps for such initiatives.
  • Monitor and optimize the performance of our infrastructure ensuring high availability, scalability, and efficient resource utilization.
  • Help define and improve important measures of AI researcher efficiency, ensuring that our actions are in line with measurable results.
  • Work closely with a variety of teams, such as researchers, data engineers, and DevOps professionals, to develop a cohesive AI/ML infrastructure ecosystem.
  • Keep up to date with the most recent developments in AI/ML technologies, frameworks, and successful strategies, and advocate for their integration within the organization.

What we need to see:

  • BS or similar background in Computer Science or related area (or equivalent experience).
  • 15+ years of demonstrated expertise in AI/ML and HPC tasks and systems.
  • Hands‑on experience in using or operating High Performance Computing (HPC) grade infrastructure as well as in-depth knowledge of accelerated computing (e.g., GPU, custom silicon), storage (e.g., Lustre, GPFS, BeeGFS), scheduling & orchestration (e.g., Slurm, Kubernetes, LSF), high-speed networking (e.g., Infiniband, RoCE, Amazon EFA), and containers technologies (Docker, Enroot).
  • Capability in supervising and improving substantial distributed training operations using PyTorch (DDP, FSDP), NeMo, or JAX. Moreover, an in-depth understanding of AI/ML workflows, involving data processing, model training, and inference pipelines.
  • Proficiency in programming & scripting languages such as Python, Go, Bash, as well as familiarity with cloud computing platforms (e.g., AWS, GCP, Azure) in addition to experience with parallel computing frameworks and paradigms.
  • Dedication to ongoing learning and staying updated on new technologies and innovative methods in the AI/ML infrastructure sector.
  • Excellent communication and collaboration skills, with the ability to work effectively with teams and individuals of different backgrounds.

NVIDIA offers competitive salaries and a comprehensive benefits package. Our engineering teams are growing rapidly due to outstanding expansion. If you’re a passionate and independent engineer with a love for technology, we want to hear from you.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 272,000 USD - 431,250 USD.

You will also be eligible for equity and benefits.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

#J-18808-Ljbffr

Vacancy posted 10 hours ago
Similar jobs that could be interesting for youBased on the Principal AI and ML Infra Software Engineer, GPU Clusters in California, MO vacancy
  • $127.1k - $226k

    Broadcom Inc. is seeking a Principal Software Engineer for our GPU Virtualization Team in California. This role focuses on developing and integrating AI Virtualization technology to provide hardware-agnostic acceleration for AI/ML workloads. The ideal candidate will thrive... 
    Principal

    Broadcom Inc.

    California, MO
    4 days ago
  •  ...looking for a skilled technical leader for GPU AI/HPC Infrastructure. You will provide...  ...and implementation of advanced GPU compute clusters for demanding workloads. The ideal candidate...  ...supporting researchers. A passion for AI/ML technologies is essential, and you'll... 
    Suggested

    Dormont Manufacturing Co

    California, MO
    2 days ago
  • Broadcom is seeking an experienced R&D Principal Software Engineer to develop AI virtualization stack for GPUs, enhancing ESXi server capabilities. This role involves leading initiatives in ML acceleration for diverse hardware environments. With over 12 years of experience... 
    Principal

    jobs.frontdoordefense.com - Jobboard

    California, MO
    4 days ago
  • Dormont Manufacturing Co is seeking a Principal Software Engineer to shape the technical direction for NVIDIA DGX Cloud operations. In this senior...  ..., you will define and execute strategies for large-scale GPU cluster operations, focusing on automation and reliability. The... 
    Principal

    Dormont Manufacturing Co

    California, MO
    1 day ago
  • $184k - $287.5k

     ...highly skilled and motivated software engineers to join us and build AI inference systems that...  ...inference stacks, optimize GPU kernels and compilers, drive...  ...deployments on GPU clusters across clouds. Conduct and...  ...frontier for the field of ML Systems; survey recent publications... 
    Suggested

    Dormont Manufacturing Company

    California, MO
    2 days ago
  • $272k - $431.25k

     ...into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of...  ...a deeply technical, hands‑on Principal Engineer to lead the security foundations...  ...partner closely with Cloud, AI/ML & Generative AI workforce, internal... 
    Principal

    Dormont Manufacturing Company

    California, MO
    2 days ago
  • $190k - $270k

    Staff Software Engineer - AI Research Infrastructure P-1215 At the company, we...  ...closely with research scientists, ML engineers, and platform...  ...Engineer on the AI Research Infra Team at the company, you...  ...and model training (e.g., HPC clusters, GPU fleets, or cloud‑based systems... 
    Local area

    United States Digital Space LLC

    California, MO
    1 day ago
  • $168k - $270.25k

    Senior Software Engineer, AI Performance Analysis NVIDIA GPU Architecture Group is seeking a senior software engineer to automate and optimize performance analysis...  ...Background building developer tools or platforms for ML engineers Contributions to open‑source AI tooling or... 
    Work experience placement

    Dormont Manufacturing Co

    California, MO
    1 day ago
  • $123.24k - $180k

    Overview of Role As a Sr./Principal AI Engineer within TSMC's Artificial Intelligence for Business Intelligence...  ...7+ years of professional experience in software engineering, machine learning...  ...(GCP Vertex AI, AWS SageMaker, Azure ML) and containerized workflows (Docker, Kubernetes... 
    Principal
    Work at office

    TSMC - Taiwan Semiconductor Manufacturing Company Limited

    California, MO
    4 days ago
  • $272k - $431.25k

     ...Networking Systems & Software Architecture group is solving some of AI’s hardest...  ...interconnects. This Principal Architect role leads...  ...communication systems—GPU‑to‑GPU, GPU‑to‑storage...  ...mentoring senior engineers across the...  ...~ Understanding of ML systems concepts—transformer... 
    Principal

    Dormont Manufacturing Co

    California, MO
    1 day ago
  • United States Digital Space LLC is seeking a Staff Software Engineer for AI Research Infrastructure in California. This role involves developing and operating research stacks, focusing on large-scale experiments and data processing in AI. The ideal candidate has over 5... 

    United States Digital Space LLC

    California, MO
    5 days ago
  • $272k - $431.25k

     ...seeking a highly motivated Principal System Software Engineer to drive next-generation innovations...  ..., architecture, kernel, AI, middleware, and platform...  ...optimization initiatives across CPU, GPU, memory, storage, networking...  ...computing and AI/ML software platforms. Contributions... 
    Principal

    Dormont Manufacturing Co

    California, MO
    1 day ago
  • $250.3k - $289k

    Position Summary As a Principal AI/ML engineer, this technical leader will be responsible for the design and implementation of the cloud-native...  ...Qualifications & Experience A technically deep and innovative Software Engineer leader able to act and deliver to business needs.... 
    Principal
    Flexible hours

    WEX Inc.

    California, MO
    1 day ago
  • $272k - $431.25k

     ...unlimited potential of AI to define the...  ...in which our GPU acts as the...  ...At NVIDIA, as a Principal Rack Scale...  ...Infrastructure Engineer, you will build...  ...development of software systems. These...  ...experience with rack‑or cluster‑scale systems...  ...firmware, and infra management as... 
    Principal
    Shift work

    Dormont Manufacturing Co

    California, MO
    1 day ago
  • NVIDIA is seeking an experienced Senior Software Engineer for the cuEquivariance team. This role focuses on building and optimizing CUDA kernels...  ...proficiency in C++ and Python, and a solid understanding of GPU programming concepts. A competitive salary and comprehensive benefits... 

    Dormont Manufacturing Co

    California, MO
    1 day ago
  • $272k - $431.25k

     ...NVIDIA’s invention of the GPU in 1999 sparked the growth of the PC...  ...are increasingly known as “the AI computing company.” We are looking...  .... We are looking for expert engineers to come and help design rack level...  ...Compute. Experience with ML and multi-variable optimization... 
    Principal

    Dormont Manufacturing Co

    California, MO
    3 days ago
  • $220k - $300k

     ...potential of generative AI to power the...  ...are at the forefront of software and hardware innovation...  ...Security Architect (Senior Principal) d-Matrix is seeking...  ...the latest research in ML, architecture, and security...  ...subsequently working with the engineering teams to incorporate... 
    Principal
    Remote work

    Dormont Manufacturing Co

    California, MO
    1 day ago
  • $152k - $241.5k

     ...the unlimited potential of AI to define the next era of computing...  .... An era in which our GPU acts as the brains of...  ...We are looking for a Senior Software Engineer to join our mission to continue...  ...or control planes for HPC clusters, large‑scale AI/ML platforms, or systems... 

    Dormont Manufacturing Company

    California, MO
    2 days ago
  • $112k - $160k

     ...We’re looking for an experienced Senior Software Engineer to join our Digital Experience team. This...  ...Engineering, you will be responsible for building AI features, shipping resilient services,...  ..., or MS with 2+ years, delivering ML/GenAI features to production. Exceptional... 
    Full time
    Work at office
    Local area
    3 days per week

    6AM City, LLC

    California, MO
    1 day ago
  • $130k - $155k

    Senior Software Engineer - Full-Stack & AI Location: Cambridge, ON | On-site Job Type: Full-Time Compensation: $130,000 - $155,000 Company Benefits RRSP...  ...— you respect that while modernizing around it. AI/ML fundamentals — You understand how LLMs work and have practical... 
    Full time

    United States Digital Space LLC

    California, MO
    5 days ago
  • $141.8k - $173.3k

     ...volunteer time off + donation match IMPACT YOU'LL MAKE: As a Sr. AI Software Engineer, you'll play a key role in bringing advanced AI capabilities...  ...services, including highimpact components that integrate AI/ML models or AI service APIs. Design Scalable Systems: Design... 
    Remote work

    Boeing Employees Credit Union

    California, MO
    5 days ago
  •  ...of computer user agents - AI systems that can actually use...  ...backend/infrastructure engineer who thrives in ambiguity, has...  ...deployment pipelines. Tackle hard infra challenges: VM...  ...end. Bonus: you’ve touched GPU scheduling, large-scale ML infra, or scaling SaaS systems... 

    Simular Inc.

    California, MO
    1 day ago
  • High 5 Games is seeking an entry-level DevOps Engineer - ML & Data Infrastructure in California. In this full-time role, you will design and...  ...pipelines. This dynamic position plays a crucial role in scaling AI models for millions of players globally. #J-18808-Ljbffr... 
    Full time

    TryApplyNow

    California, MO
    4 days ago
  • $168k - $322k

    NVIDIA is seeking an experienced AI Platform Engineer to build and maintain AI-powered enterprise products in California. This pivotal role ensures reliable systems and collaboration with Cloud and AI/ML teams in a dynamic environment. The ideal candidate has over 10 years... 

    Dormont Manufacturing Company

    California, MO
    1 day ago
  • $184k - $287.5k

     ...Manufacturing Co is seeking an experienced AI Systems Engineer to develop innovative software solutions for AI inference. You...  ...6 years of industry experience in ML/DL systems development, strong...  ...programming skills in Python and C/C++, and GPU optimization expertise. The... 

    Dormont Manufacturing Co

    California, MO
    5 days ago
  •  ...building the next-generation AI compiler for training deep...  ...re looking for a passionate Software Developer - ML/AI Compiler to help shape the...  ...Computer Science , Computer Engineering , or equivalent practical experience...  ...low-level optimization on GPU, CPU, or AI accelerators .... 

    Yasp

    California, MO
    3 days ago
  • $184k - $287.5k

     ...Manufacturing Co is seeking highly skilled software engineers to join their team in California. The role involves developing and optimizing AI inference systems for large-scale models,...  ...experience in Python, C/C++, and GPU programming. Compensation ranges from $18... 

    Dormont Manufacturing Co

    California, MO
    2 days ago
  • Dormont Manufacturing Co is seeking an experienced software developer for the Holoscan SDK team at NVIDIA. You will architect...  ...performance software development, with a deep understanding of AI, computer vision, and GPU architectures. A passion for innovative solutions in real-... 

    Dormont Manufacturing Co

    California, MO
    1 day ago
  • $185.2k - $299.48k

     ...Job Summary The ADEM engineering team is the engine of...  ...we create them. As a Sr Principal Engineer focused on the...  ...visionary Senior Principal AI / Data Scientist to...  ...Responsibilities AI/ML Architecture & Agentic...  ...workflows) into production. Software Engineering & Coding:... 
    Principal
    Shift work

    Dormont Manufacturing Co

    California, MO
    1 day ago
  • $272k - $431.25k

    NVIDIA’s invention of the GPU in 1999 sparked the...  ...learning ignited modern AI—the next era of computing...  ...assistants and engineering‑productivity tools to data...  ...company. Now we need a principal‑level, hands‑on engineering...  ...behave like mature software, not prototypes. Build... 
    Principal
    Live in

    Dormont Manufacturing Company

    California, MO
    5 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Principal AI and ML Infra Software Engineer, GPU Clusters. Be the first to apply!