Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

ML Infra Engineer: Scalable GPU Training

Garuda Ventures

Garuda Ventures is seeking a Machine Learning Infrastructure Engineer in Palo Alto to develop systems for large-scale model training. The ideal candidate will excel in distributed training, managing core ML infrastructure, and rapid iteration across numerous GPUs. This role demands familiarity with PyTorch or JAX, focusing on enhancing performance and reliability while partnering closely with researchers to facilitate efficient model training and deployment. #J-18808-Ljbffr Garuda Ventures

Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the ML Infra Engineer: Scalable GPU Training in Palo Alto, CA vacancy
  • $272k - $431.25k

     ...Principal Ai And Ml Infra Software Engineer, Gpu Clusters We are seeking a Principal AI and ML Infra...  ...we can craft potent, effective, and scalable solutions as we mold the future of AI...  ...and improving substantial distributed training operations using PyTorch (DDP, FSDP)... 
    Training

    NVIDIA

    Santa Clara, CA
    3 days ago
  • $170k - $216k

     ...Machine Learning Engineer (Infra), Driver Understanding and Evaluation...  ...team will build and operate scalable machine learning and data systems...  ...machine learning models to deliver training and evaluation data for...  ...distributed systems covering the ML lifecycle, supporting planet-... 
    Training
    Full time

    Waymo

    Mountain View, CA
    3 days ago
  •  ...environments and handling scenarios unseen in training. We work at the intersection of...  ...reality. We're looking for a Senior ML & Data Infrastructure Engineer to own and scale the systems that...  ...at scale Support integration and scalable deployment of vision-language models... 
    Training
    Immediate start

    Rhoda AI

    Palo Alto, CA
    2 days ago
  • $275.8k - $340.5k

     ...About the team: The AV ML Infra team at GM builds ML infrastructure...  ...Science, and more. We enable scalable and efficient ML...  ...enhance the productivity of ML engineers, and drive the adoption of cutting...  ...Streamlines andoptimizeslarge-scale ML training and inference across cloud... 
    Training
    Local area
    Remote work
    Work from home
    Relocation
    Relocation package
    Flexible hours

    General Motors

    Sunnyvale, CA
    2 days ago
  • $272k - $431.25k

    NVIDIA Corporation seeks a Principal AI and ML Infra Software Engineer in Santa Clara, California, to enhance the efficiency of AI/ML research on GPU Clusters. The role involves collaboration with various teams, monitoring infrastructure performance, and implementing improvements... 
    Suggested

    NVIDIA Corporation

    Santa Clara, CA
    1 day ago
  • NVIDIA Gruppe is seeking a Principal AI and ML Infra Software Engineer to join our Hardware Infrastructure team in Santa Clara, CA. In this role,...  ...enhance efficiency by addressing infrastructure deficiencies for GPU Clusters, fostering innovations in AI/ML research. The... 

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • $152k - $287.5k

    NVIDIA Gruppe, based in Santa Clara, is seeking a Senior Software Engineer to accelerate the development of machine learning innovations. In this role, you'll design and implement solutions for GPU clusters, enabling researchers to optimize their work. Strong expertise... 

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • $153k - $222k

    Decisive Point is looking for infrastructure engineers and ML engineers to join the Data & ML infra group in Mountain View, California. The role focuses on working across the ML lifecycle and solving broad data problems. Ideal candidates will have software engineering... 
    Training

    Decisive Point

    Mountain View, CA
    4 days ago
  • $296.3k

     ...About the team: The AV ML Infra team at GM builds ML infrastructure...  ...Science, and more. We enable scalable and efficient ML...  ...enhance the productivity of ML engineers, and drive the adoption of cutting...  ...Streamlines andoptimizeslarge-scale ML training and inference across cloud... 
    Training
    Local area
    Work from home
    Flexible hours

    General Motors

    Sunnyvale, CA
    4 days ago
  • General Motors in Sunnyvale, California, is offering a Staff ML Infra Engineer position that focuses on enhancing autonomous driving...  ...learning solutions. The role involves designing scalable systems for training and evaluating ML models, requiring a strong background... 
    Training
    Remote work

    General Motors

    Sunnyvale, CA
    4 days ago
  • A technology company in Palo Alto is seeking a Senior ML & Data Infrastructure Engineer to own and scale its data infrastructure. The role involves architecting a high-throughput system for managing billions of clips, optimizing storage solutions, and collaborating directly... 

    Rhoda AI

    Palo Alto, CA
    2 days ago
  •  ...handling scenarios unseen in training. We work at the...  ...reality. We're looking for an ML Infrastructure Engineer to help build and operate the...  ...Contribute to the reliability and scalability of the inference stack as...  ...(e.g., AWS, GCP) and GPU orchestration Familiarity... 
    Training

    Rhoda AI

    Palo Alto, CA
    16 hours ago
  • $153.2k - $234.1k

     ...world scenarios. As a Senior ML Infra Engineer, you will work on the core...  ...rapid dataset generation, training, evaluation and iteration of...  ...ML models that rely on the scalable, intuitive, and high-performance...  ...ML training across large GPU/CPU clusters or specialized... 
    Training
    Local area
    Remote work
    Work from home
    Relocation package
    Flexible hours

    General Motors

    Sunnyvale, CA
    3 days ago
  • Cognita Imaging Inc. is seeking a Member of Technical Staff for the ML Infrastructure team in Palo Alto, California. This role involves building and managing the infrastructure for machine learning systems, focusing on distributed systems and model serving. Candidates... 
    Training

    Cognita Imaging Inc.

    Palo Alto, CA
    16 hours ago
  •  ...-edge robotics company based in California is looking for an experienced Machine Learning Infrastructure Engineer. This role involves designing scalable ML training platforms, optimizing high-performance computing systems, and ensuring robust job scheduling and reliability... 
    Training

    Dyna Robotics

    Redwood City, CA
    2 days ago
  • $170k - $360k

     ...Software Engineer - Data Infra Reliability As our models scale to "omni...  ...resilience, automation, and scalability of the petabyte-scale...  ...allow our researchers to train on massive datasets without...  ...Experience managing GPU clusters or AI/ML workloads. Background in... 
    Training

    Luma AI

    Palo Alto, CA
    5 days ago
  •  ...handling scenarios unseen in training. We work at the...  ...hiring a Staff/Principal ML Systems Engineer to own training...  ...identification at different GPU counts Drive...  ...translate model changes into scalable implementations Provide...  ...(as part of the infra team) Partner with infra... 
    Training

    Rhoda AI

    Palo Alto, CA
    2 days ago
  • $189.3k - $290.7k

     ...-world scenarios.As a Staff ML Infra Engineer, you will drive the development...  ...rapid dataset generation, training, evaluation, and iteration...  ...ML models that rely on the scalable, intuitive, and high‑performance...  ...ML training across large GPU/CPU clusters or other acceleratorsFamiliarity... 
    Training
    Local area
    Remote work
    Work from home
    Relocation
    Relocation package
    Flexible hours

    General Motors

    Sunnyvale, CA
    4 days ago
  • $150k - $250k

     ...Machine Learning Engineer Goldman Sachs is a leading...  ...Engineering focused on running scalable production management...  ...PRX applies advanced ML and GenAI to reduce the...  ...(SageMaker), and infra-as-code (Terraform/CloudFormation...  ...access to excellent training programs designed to... 
    Training
    Full time
    Temporary work
    Part time
    Worldwide

    The Goldman Sachs Group, Inc.

    Menlo Park, CA
    4 days ago
  • $129k - $198.4k

     ...Description Role: As an AI/ML Engineer on the Metrics Frameworks...  ...monitoring, data mining and training, and simulation metrics. About...  ...safe, high-performing, and scalable driverless technology. The team...  ...other frameworks and data infra teams to build and deploy tools... 
    Training
    Local area
    Work from home

    General Motors

    Mountain View, CA
    4 days ago
  • $166k - $244k

    A leading tech company based in Mountain View is looking for a Senior Software Engineer for Machine Learning in the Core ML division. The role involves programming in Python or C++, managing ML infrastructure, and architecting model transitions. Preferred candidates will... 
    Training

    Google Inc.

    Mountain View, CA
    3 days ago
  •  ...As a Machine Learning Engineer in AI Core, Data Intelligence...  ...— from building scalable data infrastructure and...  ...foundation models to designing, training, and shipping...  ...multiple stages of the ML lifecycle: ingesting and...  ...Core ML, Platform, and Infra teams to ensure seamless... 
    Training
    Work from home
    Relocation package
    Flexible hours

    Nubank

    Palo Alto, CA
    3 days ago
  •  ...teams to close the gap between training and real-world deployment....  ...Collaborate closely with research engineers to translate model innovations...  ...in inference optimization, ML systems, or a closely related...  ...Have (But Not Required) GPU kernel or compiler-level experience... 
    Training

    Rhoda ai

    Palo Alto, CA
    5 days ago
  • $275.8k - $340.5k

    About the Team The AV ML Infra team at GM builds ML infrastructure designed to meet the unique demands...  ..., Data Science, and more. We enable scalable and efficient ML experimentation, enhance the productivity of ML engineers, and drive the adoption of cutting‑edge ML techniques... 
    Remote work
    Relocation
    Relocation package
    Flexible hours

    General Motors

    Mountain View, CA
    1 day ago
  • $120k - $140k

     ...Deep Learning / NLP Engineer Join Avoma and work on some of the most challenging NLP problems in our mission...  ...similar LLMs. Experience writing and shipping scalable, high-performant and clean code Experience training Deep Neural Networks Experience with... 
    Training

    Avoma Inc

    Palo Alto, CA
    5 days ago
  •  ...As a Machine Learning Engineer, you will play a central...  ...learning research into scalable, production-ready solutions...  ...opportunities where ML can drive product value...  ...data pipelines, model training, debugging, and performance...  ...with multi-node GPU training. Contributor... 
    Training

    Nace AI

    Palo Alto, CA
    1 day ago
  • $213k - $263k

     ...Machine Learning Engineer, Simulation Realism Waymo...  ...environments for testing and training the Waymo Driver. Our...  ..., machine learning (ML) engineers, and data scientists...  ...transfer research into scalable and production-ready...  ...large scale models on GPU/TPU clusters are... 
    Training
    Full time
    Remote work

    Waymo

    Mountain View, CA
    3 days ago
  • $230k - $265k

     ...AI team responsible for ML and work alongside...  ...veteran scientists and engineers. As a Senior Machine Learning...  ...and implementation of training, fine-tuning, post-...  ...cutting-edge research into scalable, production-grade...  ...end-to-end with product, infra, research, and data teams... 
    Training
    Permanent employment

    Otter.ai

    Mountain View, CA
    1 day ago
  • $158k - $241.9k

     ...global scale. Role: As a Senior AI/ML Engineer within the Onboard Embodied AI...  ...approaches with sophisticated neural networks trained from large-scale driving data and using...  ...to onboard implementation, emphasizing scalability, robustness, and safety-critical operation... 
    Training
    Local area
    Work from home
    Relocation package
    Flexible hours

    General Motors

    Mountain View, CA
    2 days ago
  •  ...ML Engineer Palo Alto, California, United States About the Job Our client is a rapidly...  ...data acquisition, preprocessing, model training, deployment, inference, and monitoring...  ...ML infrastructure and processes for scalability and performance. Qualifications... 
    Training
    Full time

    Catalyst Labs, LLC

    Palo Alto, CA
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to ML Infra Engineer: Scalable GPU Training. Be the first to apply!