Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Research Scientist / Engineer - Training Infrastructure

Full-time

Intellipro Group Inc

Job Description

Job Description

Job Title:  Research Scientist / Engineer – Training Infrastructure
Position Type: Full time
Location: Palo Alto, CA • Remote - US • Remote - International
Salary Range: $220,000 - $300, 000 (USD)
Job ID#: 154559

Job Description:

We believe that multimodality is critical for intelligence. To go beyond language models and build more aware, capable and useful systems, the next step function change will come from vision. So, we are working on training and scaling up multimodal foundation models for systems that can see and understand, show and explain, and eventually interact with our world to effect change. We are looking for engineers with significant experience solving hard problems in PyTorch, CUDA and distributed systems. You will work alongside the rest of the research team to build & train cutting edge foundation models on thousands of GPUs that are built to scale from the ground up.

Responsibilities
  • Design, implement, and optimize efficient distributed training systems for models with thousands of GPUs
  • Research and implement advanced parallelization techniques (FSDP, Tensor Parallel, Pipeline Parallel, Expert Parallel)
  • Build monitoring, visualization, and debugging tools for large-scale training runs
  • Optimize training stability, convergence, and resource utilization across massive clusters
Requirements:
  • Extensive experience with distributed PyTorch training and parallelisms in foundation model training
  • Deep understanding of GPU clusters, networking, and storage systems
  • Familiarity with communication libraries (NCCL, MPI) and distributed system optimization
  • (Preferred) Strong Linux systems administration and scripting capabilities
  • (Preferred) Experience managing training runs across >100 GPUs
  • (Preferred) Experience with containerization, orchestration, and cloud infrastructure
About Us:
Founded in 2009, IntelliPro is a global leader in talent acquisition and HR solutions. Our commitment to delivering unparalleled service to clients, fostering employee growth, and building enduring partnerships sets us apart. We continue leading global talent solutions with a dynamic presence in over 160 countries, including the USA, China, Canada, Singapore, Japan, Philippines, UK, India, Netherlands, and the EU.
IntelliPro, a global leader connecting individuals with rewarding employment opportunities, is dedicated to understanding your career aspirations. As an Equal Opportunity Employer, IntelliPro values diversity and does not discriminate based on race, color, religion, sex, sexual orientation, gender identity, national origin, age, genetic information, disability, or any other legally protected group status. Moreover, our Inclusivity Commitment emphasizes embracing candidates of all abilities and ensures that our hiring and interview processes accommodate the needs of all applicants. Learn more about our commitment to diversity and inclusivity at

Compensation: The pay offered to a successful candidate will be determined by various factors, including education, work experience, location, job responsibilities, certifications, and more. Additionally, IntelliPro provides a comprehensive benefits package, all subject to eligibility.

Powered by JazzHR

UYmlewWb2Y

Vacancy posted a month ago
Similar jobs that could be interesting for youBased on the Research Scientist / Engineer - Training Infrastructure in Palo Alto, CA vacancy
  •  ...Job Description Job Description Job Title:  Research Scientist / Engineer – Training Infrastructure Position Type: Full time Location: Palo Alto, CA • Remote - US • Remote - International Salary Range: $220,000 - $300, 000 (USD) Job ID#: 154559 Job Description... 
    Training
    Full time
    Work experience placement
    Remote work

    Intellipro Group

    Palo Alto, CA
    11 hours ago
  •  ...possible, we are building across the entire robotics stack. We're training state-of-the-art AI models that leverage our large-scale,...  ...on the things they value most. As a Machine Learning Research Engineer, you will work on the software and algorithms that enable our... 
    Training

    Sunday

    Redwood City, CA
    5 days ago
  • $150.29k - $171.67k

     ...Cloud Engineer We are seeking a highly skilled Cloud Engineer...  ...next generation of Stanford's research computing environment. This...  ...Responsibilities Research, HPC & AI Infrastructure: Architect cloud-native and...  ...workloads, such as AI training and genomics. Cloud... 
    Training
    Hourly pay
    Weekend work
    Afternoon shift

    Stanford

    Stanford, CA
    22 hours ago
  • $90 - $121.86 per hour

     ...Job Description Job Description LLM Research Engineer Key Responsibilities: Design, train, and fine-tune large language models (e.g., GPT, LLaMA, PaLM) for various applications. Conduct research on cutting-edge techniques in natural language processing (NLP... 
    Training
    Hourly pay

    Cypress HCM

    Mountain View, CA
    23 hours ago
  •  ...of previous Stanford professors, SAIL researchers, Olympiad medalists (IPhO, IOI, etc.),...  ...Your work will enable large-scale model training, inference, and reinforcement learning...  .... Working closely with researchers and engineers, you’ll help make Voltai the world’s leading... 
    Training
    Full time

    Voltai

    Palo Alto, CA
    a month ago
  • $197k - $291k

     ...Staff Research Engineer, Applied AI Mountain View, California, US Snapshot We are seeking...  .... At Google DeepMind, we're a team of scientists, engineers, machine learning experts...  ...of hands-on experience building, training, and deploying machine learning models... 
    Training
    Full time

    DeepMind

    Mountain View, CA
    5 days ago
  •  ...the next generation of data infrastructure at Mistral AI. You will be a...  ...governed data access for MLOps and research. You will take full...  ...call rotations for critical training jobs.     What will you...  ...exabyte growth. • Platform Engineering: Contribute to the development... 
    Training
    Work at office
    Visa sponsorship

    Mistral AI

    Palo Alto, CA
    5 days ago
  • $190.58k - $200k

     ...GPU Cluster Lead Engineer Stanford Research Computing seeks an exceptional GPU Cluster Lead Engineer...  ...as the technical authority on GPU infrastructure, driving system performance and reliability...  ..., best practices guides, and training materials; deliver workshops on GPU... 
    Training
    Hourly pay
    Flexible hours
    Weekend work
    Afternoon shift

    Stanford

    Stanford, CA
    4 days ago
  • $180k - $258.75k

     ...Job Description Job Description At Toyota Research Institute (TRI), we’re on a mission to improve the quality of...  ...through to simulation and assembly — and developing the engineering infrastructure needed to train, evaluate, and iterate on these systems at scale.... 
    Training
    Full time
    Local area
    Shift work

    Toyota Research Institute

    Los Altos, CA
    a month ago
  •  ...powering the future of physical AI. digital infrastructure needed to bring intelligence to every...  ...We are looking for a passionate Research Engineer (AI/RL Infrastructure) to join the Research...  ...to our business. Design and build training and evaluation infrastructure to... 
    Training
    Full time
    For contractors
    For subcontractor
    Casual work
    Work at office
    Immediate start
    Remote work
    Day shift

    Applied Intuition

    Sunnyvale, CA
    2 days ago
  • $204k - $259k

     ...into the Waymo Driver. We conduct our own research to address real-world problems and...  ...from a diverse set of sensors, enabling engineers like you to (1) develop methods for efficiently...  ...data, to (2) develop models and model training at scale, to (3) analyze real-world... 
    Training
    Full time
    Temporary work
    Remote work

    Waymo

    Mountain View, CA
    11 hours ago
  • $175k - $215k

     ...thoroughly tested code to bring cutting-edge research into production Partner with world-class researchers, engineers, and product managers to deliver safe and smooth...  ...exact work location, experience, relevant training and education, and skill level. Your recruiter... 
    Training
    Full time
    Internship
    Remote work

    Waymo

    Mountain View, CA
    11 hours ago
  • $170k - $216k

     ...into the Waymo Driver. We conduct our own research to address real-world problems and...  ...from a diverse set of sensors, enabling engineers like you to (1) develop methods for efficiently...  ...data, to (2) develop models and model training at scale, to (3) analyze real-world... 
    Training
    Full time
    Remote work

    Waymo

    Mountain View, CA
    10 hours ago
  • $204k - $259k

     ...collaborations with other research teams in Alphabet. AI...  ...to a Principal Scientist. You will:...  ...Foundation World Model post-training and evaluation Research...  ...Waymo's internal RL infrastructure, conducting rigorous...  ...Partner with engineering and research teams across... 
    Training
    Full time
    Temporary work
    Remote work

    Waymo

    Mountain View, CA
    11 hours ago
  •  ...Machine Learning Research Scientist At Autoscience Institute, we create AI systems that autonomously...  ...models. Collaborate with the engineering team to build and deploy production-ready research systems. RL post-train and fine-tune reasoning models to automate... 
    Training
    Full time
    Flexible hours

    Autoscience Institute

    Menlo Park, CA
    1 day ago
  • $213k - $263k

     ...initiate and foster collaborations with other research teams in Alphabet. AI Foundations areas...  ...and reports to a Staff Research Scientist / Tech Lead Manager . You will:...  ...Experience in large-scale distributed training and different forms of parallelism. Experience... 
    Training
    Full time
    Temporary work
    Remote work

    Waymo

    Mountain View, CA
    11 hours ago
  •  ...At Toyota Research Institute (TRI), we're on a mission to improve the quality of human life...  .... Collaborate with researchers and engineers across TRI and Toyota's broader ecosystem...  ...project, from data processing to model training to evaluation. Genuine interest in how... 
    Training
    Work experience placement
    Internship
    Local area
    Shift work

    Toyota Research Institute

    Los Altos, CA
    1 day ago
  • $176k - $253.5k

     ...At Toyota Research Institute (TRI), we're on a mission to improve the quality of human life...  ...We are looking for an AI Research Scientist, or Senior Machine Learning Research Scientist...  ...in large-scale foundational model training, fine-tuning, evaluation and benchmarking... 
    Training
    Temporary work
    Local area
    Shift work

    Toyota Research Institute

    Los Altos, CA
    1 day ago
  •  ...this role, you will collaborate with a small team of talented researchers on ambitious, greenfield projects in generative AI and reinforcement...  ...Code-specific architectures LLM fine-tuning, post-training, RLHF Requirements Ph.D. in Computer Science or a closely... 
    Training
    Relocation package
    Flexible hours

    Code Metal

    Palo Alto, CA
    23 hours ago
  • $204k - $259k

     ...across 15+ U.S. states. The mission of the Waymo Applied Research team is to develop machine learning solutions addressing open problems...  ...research and development Design compelling experiments by training and evaluating large deep learning models Present results... 
    Training
    Full time
    Remote work

    Waymo

    Mountain View, CA
    3 days ago
  • $193.93k - $291.15k

     ...ML Research Scientist, Prediction & Smart Agents Mountain View, California (HQ) Nuro is a self-driving technology company on a mission...  ...smart, controllable agents to enable effective closed-loop training in simulation. If you are passionate about solving challenging... 
    Training

    Nuro

    Mountain View, CA
    1 day ago
  • $180k

     ...Network Engineer - ML Infrastructure (High-Speed Interconnects) Palo Alto, CA About xAI xAI's mission is to create AI systems that can...  ...and optimize the network fabric that powers large-scale AI training and inference clusters. This strategic role will drive innovation... 
    Training
    Temporary work

    Xai

    Palo Alto, CA
    2 days ago
  • $197.8k - $296.6k

    Lab Summary: The Robot Intelligence Lab at Samsung Research America is a new facility dedicated to advancing the field of robotics through...  ...in EECS/Robotics or equivalent combination of education, training, and experience 7+ years’ industry experience in robotics foundation... 
    Training
    Full time
    Work at office
    Local area

    Samsung Research America

    Mountain View, CA
    4 days ago
  •  ...Performance Computing (HPC) and AI Networking Performance Research and Analysis Engineer Intelligent machines powered by Artificial...  ...and CPUs scale clusters for distributed Deep Learning LLM training focused on collectives communication and networking. You... 
    Training

    NVIDIA

    Santa Clara, CA
    5 days ago
  •  ...Responsibilities Conduct research and development focused on robot perception, control, task planning, and model training to transition intelligent agents from the digital world...  ...such as computer science, mechanical engineering, electrical engineering, robotics, or a... 
    Training

    Ant Group

    Sunnyvale, CA
    11 hours ago
  •  ...Models We are a dedicated research lab for building,...  ...cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental...  ...architecture, training, and infrastructure to turn research ideas into... 
    Training
    Full time

    Institute of Foundation Models

    Sunnyvale, CA
    a month ago
  • $150k - $230k

     ...Clockwork Systems was founded by Stanford researchers and veteran systems engineers who share a vision for redefining...  ...increasingly complex, traditional infrastructure struggles to meet the demands of...  ...high-performance distributed GPU training. You'll work at the intersection... 
    Training

    Clockwork Inc

    Palo Alto, CA
    3 days ago
  •  ...its fast-growing teams. As a Research Engineer, you will deliver mission-critical...  ...alongside engineers, research scientists, and domain experts to build optimal...  ...Develop tools and infrastructure for dataset generation, training, and evaluation to drive advancements... 
    Training
    Full time

    PlusAI

    Santa Clara, CA
    11 days ago
  •  ...Research Engineer, Foundation Models About the Opportunity We are...  ...on the development, training, evaluation, and deployment...  ...scale datasets and training infrastructure to experimenting with new model...  ...Experimentation, Research Scientists, Research Engineers, Software... 
    Training
    Visa sponsorship
    Relocation package
    Flexible hours

    Acceler8 Talent

    Santa Clara, CA
    1 day ago
  • $204k - $259k

     ...The Simulation ML Infrastructure team builds scalable...  ...for the testing and training of the Waymo Driver....  ...This role reports to an Engineering Manager. You will:...  ...class, high-performing research engineering team to advance...  ...for engineers and scientists. ~ Excellent... 
    Training
    Full time
    Remote work

    Waymo

    Mountain View, CA
    10 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Research Scientist / Engineer - Training Infrastructure. Be the first to apply!