ML Infra Engineer: Scalable GPU Training
Garuda Ventures
Garuda Ventures is seeking a Machine Learning Infrastructure Engineer in Palo Alto to develop systems for large-scale model training. The ideal candidate will excel in distributed training, managing core ML infrastructure, and rapid iteration across numerous GPUs. This role demands familiarity with PyTorch or JAX, focusing on enhancing performance and reliability while partnering closely with researchers to facilitate efficient model training and deployment. #J-18808-Ljbffr Garuda Ventures
$272k - $431.25k
...Principal Ai And Ml Infra Software Engineer, Gpu Clusters We are seeking a Principal AI and ML Infra... ...we can craft potent, effective, and scalable solutions as we mold the future of AI... ...and improving substantial distributed training operations using PyTorch (DDP, FSDP)...Training$170k - $216k
...Machine Learning Engineer (Infra), Driver Understanding and Evaluation... ...team will build and operate scalable machine learning and data systems... ...machine learning models to deliver training and evaluation data for... ...distributed systems covering the ML lifecycle, supporting planet-...TrainingFull time- ...environments and handling scenarios unseen in training. We work at the intersection of... ...reality. We're looking for a Senior ML & Data Infrastructure Engineer to own and scale the systems that... ...at scale Support integration and scalable deployment of vision-language models...TrainingImmediate start
$275.8k - $340.5k
...About the team: The AV ML Infra team at GM builds ML infrastructure... ...Science, and more. We enable scalable and efficient ML... ...enhance the productivity of ML engineers, and drive the adoption of cutting... ...Streamlines andoptimizeslarge-scale ML training and inference across cloud...TrainingLocal areaRemote workWork from homeRelocationRelocation packageFlexible hours$272k - $431.25k
NVIDIA Corporation seeks a Principal AI and ML Infra Software Engineer in Santa Clara, California, to enhance the efficiency of AI/ML research on GPU Clusters. The role involves collaboration with various teams, monitoring infrastructure performance, and implementing improvements...Suggested- NVIDIA Gruppe is seeking a Principal AI and ML Infra Software Engineer to join our Hardware Infrastructure team in Santa Clara, CA. In this role,... ...enhance efficiency by addressing infrastructure deficiencies for GPU Clusters, fostering innovations in AI/ML research. The...
$152k - $287.5k
NVIDIA Gruppe, based in Santa Clara, is seeking a Senior Software Engineer to accelerate the development of machine learning innovations. In this role, you'll design and implement solutions for GPU clusters, enabling researchers to optimize their work. Strong expertise...$153k - $222k
Decisive Point is looking for infrastructure engineers and ML engineers to join the Data & ML infra group in Mountain View, California. The role focuses on working across the ML lifecycle and solving broad data problems. Ideal candidates will have software engineering...Training$296.3k
...About the team: The AV ML Infra team at GM builds ML infrastructure... ...Science, and more. We enable scalable and efficient ML... ...enhance the productivity of ML engineers, and drive the adoption of cutting... ...Streamlines andoptimizeslarge-scale ML training and inference across cloud...TrainingLocal areaWork from homeFlexible hours- General Motors in Sunnyvale, California, is offering a Staff ML Infra Engineer position that focuses on enhancing autonomous driving... ...learning solutions. The role involves designing scalable systems for training and evaluating ML models, requiring a strong background...TrainingRemote work
- A technology company in Palo Alto is seeking a Senior ML & Data Infrastructure Engineer to own and scale its data infrastructure. The role involves architecting a high-throughput system for managing billions of clips, optimizing storage solutions, and collaborating directly...
- ...handling scenarios unseen in training. We work at the... ...reality. We're looking for an ML Infrastructure Engineer to help build and operate the... ...Contribute to the reliability and scalability of the inference stack as... ...(e.g., AWS, GCP) and GPU orchestration Familiarity...Training
$153.2k - $234.1k
...world scenarios. As a Senior ML Infra Engineer, you will work on the core... ...rapid dataset generation, training, evaluation and iteration of... ...ML models that rely on the scalable, intuitive, and high-performance... ...ML training across large GPU/CPU clusters or specialized...TrainingLocal areaRemote workWork from homeRelocation packageFlexible hours- Cognita Imaging Inc. is seeking a Member of Technical Staff for the ML Infrastructure team in Palo Alto, California. This role involves building and managing the infrastructure for machine learning systems, focusing on distributed systems and model serving. Candidates...Training
- ...-edge robotics company based in California is looking for an experienced Machine Learning Infrastructure Engineer. This role involves designing scalable ML training platforms, optimizing high-performance computing systems, and ensuring robust job scheduling and reliability...Training
$170k - $360k
...Software Engineer - Data Infra Reliability As our models scale to "omni... ...resilience, automation, and scalability of the petabyte-scale... ...allow our researchers to train on massive datasets without... ...Experience managing GPU clusters or AI/ML workloads. Background in...Training- ...handling scenarios unseen in training. We work at the... ...hiring a Staff/Principal ML Systems Engineer to own training... ...identification at different GPU counts Drive... ...translate model changes into scalable implementations Provide... ...(as part of the infra team) Partner with infra...Training
$189.3k - $290.7k
...-world scenarios.As a Staff ML Infra Engineer, you will drive the development... ...rapid dataset generation, training, evaluation, and iteration... ...ML models that rely on the scalable, intuitive, and high‑performance... ...ML training across large GPU/CPU clusters or other acceleratorsFamiliarity...TrainingLocal areaRemote workWork from homeRelocationRelocation packageFlexible hours$150k - $250k
...Machine Learning Engineer Goldman Sachs is a leading... ...Engineering focused on running scalable production management... ...PRX applies advanced ML and GenAI to reduce the... ...(SageMaker), and infra-as-code (Terraform/CloudFormation... ...access to excellent training programs designed to...TrainingFull timeTemporary workPart timeWorldwide$129k - $198.4k
...Description Role: As an AI/ML Engineer on the Metrics Frameworks... ...monitoring, data mining and training, and simulation metrics. About... ...safe, high-performing, and scalable driverless technology. The team... ...other frameworks and data infra teams to build and deploy tools...TrainingLocal areaWork from home$166k - $244k
A leading tech company based in Mountain View is looking for a Senior Software Engineer for Machine Learning in the Core ML division. The role involves programming in Python or C++, managing ML infrastructure, and architecting model transitions. Preferred candidates will...Training- ...As a Machine Learning Engineer in AI Core, Data Intelligence... ...— from building scalable data infrastructure and... ...foundation models to designing, training, and shipping... ...multiple stages of the ML lifecycle: ingesting and... ...Core ML, Platform, and Infra teams to ensure seamless...TrainingWork from homeRelocation packageFlexible hours
- ...teams to close the gap between training and real-world deployment.... ...Collaborate closely with research engineers to translate model innovations... ...in inference optimization, ML systems, or a closely related... ...Have (But Not Required) GPU kernel or compiler-level experience...Training
$275.8k - $340.5k
About the Team The AV ML Infra team at GM builds ML infrastructure designed to meet the unique demands... ..., Data Science, and more. We enable scalable and efficient ML experimentation, enhance the productivity of ML engineers, and drive the adoption of cutting‑edge ML techniques...Remote workRelocationRelocation packageFlexible hours$120k - $140k
...Deep Learning / NLP Engineer Join Avoma and work on some of the most challenging NLP problems in our mission... ...similar LLMs. Experience writing and shipping scalable, high-performant and clean code Experience training Deep Neural Networks Experience with...Training- ...As a Machine Learning Engineer, you will play a central... ...learning research into scalable, production-ready solutions... ...opportunities where ML can drive product value... ...data pipelines, model training, debugging, and performance... ...with multi-node GPU training. Contributor...Training
$213k - $263k
...Machine Learning Engineer, Simulation Realism Waymo... ...environments for testing and training the Waymo Driver. Our... ..., machine learning (ML) engineers, and data scientists... ...transfer research into scalable and production-ready... ...large scale models on GPU/TPU clusters are...TrainingFull timeRemote work$230k - $265k
...AI team responsible for ML and work alongside... ...veteran scientists and engineers. As a Senior Machine Learning... ...and implementation of training, fine-tuning, post-... ...cutting-edge research into scalable, production-grade... ...end-to-end with product, infra, research, and data teams...TrainingPermanent employment$158k - $241.9k
...global scale. Role: As a Senior AI/ML Engineer within the Onboard Embodied AI... ...approaches with sophisticated neural networks trained from large-scale driving data and using... ...to onboard implementation, emphasizing scalability, robustness, and safety-critical operation...TrainingLocal areaWork from homeRelocation packageFlexible hours- ...ML Engineer Palo Alto, California, United States About the Job Our client is a rapidly... ...data acquisition, preprocessing, model training, deployment, inference, and monitoring... ...ML infrastructure and processes for scalability and performance. Qualifications...TrainingFull time
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to ML Infra Engineer: Scalable GPU Training. Be the first to apply!
- senior ml engineer Palo Alto, CA
- machine learning ai engineer Palo Alto, CA
- computer vision machine learning engineer Palo Alto, CA
- machine learning software engineer Palo Alto, CA
- ai ml engineer Palo Alto, CA
- machine learning engineer Palo Alto, CA
- machine learning remote Palo Alto, CA
- machine learning scientist Palo Alto, CA
- machine learning intern Palo Alto, CA
- data engineer machine learning Palo Alto, CA

