Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Member of Technical Staff - Distributed Training Engineer

Gravity Engineering Services Pvt Ltd.

About Liquid AI Spun out of MIT CSAIL, we build general-purpose AI systems that run efficiently across deployment targets, from data center accelerators to on-device hardware, ensuring low latency, minimal memory usage, privacy, and reliability. We partner with enterprises across consumer electronics, automotive, life sciences, and financial services. We are scaling rapidly and need exceptional people to help us get there. The Opportunity Our Training Infrastructure team is building the distributed systems that power our next-generation Liquid Foundation Models. As we scale, we need to design, implement, and optimize the infrastructure that enables large-scale training. This is a high-ownership training systems role focused on runtime/performance/reliability (not a general platform/SRE role). You’ll work on a small team with fast feedback loops, building critical systems from the ground up rather than inheriting mature infrastructure. What We're Looking For Loves distributed systems complexity: Our team builds systems that keeps long training runs stable, debugs training failures across GPU clusters, and improves performance. Wants to build: We need builders who find satisfaction in robust, fast, reliable infrastructure. Thrives in ambiguity: Our systems support model architectures that are still evolving. We make decisions with incomplete information and iterate quickly. Aligns with team priorities and delivers: Our best engineers align with team priorities while pushing back with data when they see problems. The Work Design and build core systems that make large training runs fast and reliable Build scalable distributed training infrastructure for GPU clusters Implement and tune parallelism/sharding strategies for evolving architectures Optimize distributed efficiency (topology-aware collectives, comm/compute overlap, straggler mitigation) Build data loading systems that eliminate I/O bottlenecks for multimodal datasets Develop checkpointing mechanisms balancing memory constraints with recovery needs Create monitoring, profiling, and debugging tools for training stability and performance Desired Experience Must-have: Hands-on experience building distributed training infrastructure (PyTorch Distributed DDP/FSDP, DeepSpeed ZeRO, Megatron-LM TP/PP) Experience diagnosing performance bottlenecks and failure modes (profiling, NCCL/collectives issues, hangs, OOMs, stragglers) Understanding of hardware accelerators and networking topologies Experience optimizing data pipelines for ML workloads Nice-to-have: MoE (Mixture of Experts) training experience Large-scale distributed training (100+ GPUs) Open-source contributions to training infrastructure projects What Success Looks Like (Year One) Training throughput has increased Overall training efficiency/cost has improved Training stability has improved (fewer failures, faster recovery) Data loading bottlenecks are eliminated for multimodal workloads #J-18808-Ljbffr

Vacancy posted 3 days ago
Similar jobs that could be interesting for youBased on the Member of Technical Staff - Distributed Training Engineer in San Francisco, CA vacancy
  •  ...time Location Type On-site Department Engineering Our Mission Reflection’s mission is to...  ...shared services that power our research, training, and production environments. These systems...  ...environments, multi-tenant isolation. Distributed Systems Architecture: Sharding,... 
    Training
    Full time
    Relocation package

    B Capital

    San Francisco, CA
    3 days ago
  • $60k - $240k

     ...Member of Technical Staff - Data Quality Engineer (Pre-training) Join to apply for the Member of Technical Staff - Data Quality Engineer (Pre-training) role at Reflection AI Our Mission Reflection’s mission is to build open superintelligence and make it accessible to all... 
    Training
    Relocation package

    Reflection AI

    San Francisco, CA
    3 days ago
  • $150k

     ...real‑world datasets to train and deploy state‑of‑...  ...implement, and optimize distributed training systems that...  ...tools. Participate in technical discussions about new...  ...tech lead, or leading an engineering team. Expertise in...  ...employees, supervisors, and staff; adhere to standards... 
    Training
    Internship
    Local area

    Amazon Science

    San Francisco, CA
    4 days ago
  • $150k - $300k

     ...enables anyone to create, train, and deploy them. We...  ...training stack. Core Technical Responsibilities LLM...  ...throughput/latency SLOs. Model Distribution: Optimize model...  ...: LLM Inference engine development and integration...  ...development and encourage team members to contribute to the... 
    Training
    Work at office
    Remote work
    Visa sponsorship
    Relocation package
    Flexible hours
    Shift work

    Prime Intellect

    San Francisco, CA
    3 days ago
  • $150k - $300k

     ...enables anyone to create, train, and deploy them. We...  ...management You will work on a distributed system with performance engineering at its core. The role...  ...reliable at scale. Core Technical Responsibilities...  ...development and encourage team members to contribute to the broader... 
    Training
    Work at office
    Remote work
    Visa sponsorship
    Relocation package
    Flexible hours

    Prime Intellect

    San Francisco, CA
    4 days ago
  •  ...foundational infrastructure to train specialized AI agents. We...  ...one seamless system. As a Member of Technical Staff, Infrastructure / DevOps,...  ...stack. Partner with research engineers to turn experimental...  ...experience building or operating distributed systems, cloud infrastructure... 
    Training

    Plato.ai

    San Francisco, CA
    4 days ago
  • $300k

     ...Member of Technical Staff - Mechanistic Interpretability About V max V max is...  ...interpretability to extract useful training signals from the internal...  ...RL infrastructure, distributed training, experiment tracking...  .... Demonstrated software engineering ability, especially in... 
    Training
    Work at office
    Local area

    VMAX LLC

    San Francisco, CA
    4 days ago
  •  ...Member of Technical Staff, Applied Research About Us At Fireworks, we’re building...  ...intersection of ML research, systems engineering, and customer‑facing problem...  ..., algorithms, concurrency, distributed systems, networking. Hands‑on experience training, fine‑tuning, or evaluating... 
    Training

    SupportFinity

    San Francisco, CA
    4 days ago
  •  ...Pixeltable Inc. Member of Technical Staff San Francisco, CA·Full time Apply for...  ...a founding member of the engineering team, you will impact the...  ...ingestion, transformation, training/fine-tuning, and inference...  ...experience in an industry setting: distributed data systems, cloud... 
    Training
    Full time
    Part time
    Work at office
    Work from home
    Flexible hours
    2 days per week

    Pixeltable, Inc.

    San Francisco, CA
    4 days ago
  •  ...Member of Technical Staff - Post‑Training Join to apply for the Member of Technical Staff - Post‑Training role...  ...general agents. Drive research and engineering initiatives that push the frontier...  ...into complex ML codebases and distributed systems. Experience improving model... 
    Training
    Full time
    Relocation package

    Reflection AI

    San Francisco, CA
    4 days ago
  •  ...to serve humanity. We’re training and deploying frontier models...  ...is a team of researchers, engineers, designers, and more, who...  ...and shape the future! Member of Technical Staff, Search Why this role? We...  ...Experience using large-scale distributed training strategies with GPUs... 
    Training
    Full time
    Work at office
    Remote work
    Flexible hours

    Cohere

    San Francisco, CA
    3 days ago
  •  ...to serve humanity. We’re training and deploying frontier models...  ...is a team of researchers, engineers, designers, and more, who...  ...remote-friendly! As a Member of Technical Staff, you will: Design and write...  ...XLA/MLIR. Experience with distributed training infrastructures (... 
    Training
    Full time
    Work at office
    Remote work
    Flexible hours

    Jaide Health

    San Francisco, CA
    4 days ago
  •  ...is bringing the rigor of distributed systems, model architecture...  ...redesigned the foundation model training stack to turn the world’s...  ...the barrier to creating engineered threats and AI-generated...  .... About the Role As a Member of Technical Staff, Mechanistic Interpretability... 
    Training
    Local area

    Radical Numerics Inc.

    San Francisco, CA
    12 hours ago
  •  ...is bringing the rigor of distributed systems, model architecture...  ...redesigned the foundation model training stack to turn the world’s...  ...the barrier to creating engineered threats and AI-generated...  .... About the Role As a Member of Technical Staff, Pre-Training Science at Radical... 
    Training
    Local area

    Radical Numerics Inc.

    San Francisco, CA
    13 hours ago
  •  ...robotic intelligence. As a Member of Technical Staff, you'll be at the forefront...  ...working closely with robotics engineers to integrate your solutions...  ...real‑world datasets to train and deploy state‑of‑the‑art...  ...learning Familiarity with distributed training systems Track record... 
    Training
    Local area

    Amazon Science

    San Francisco, CA
    23 hours ago
  •  ...Member of Technical Staff, Document Understanding Join us and help shape the future...  ...are seeking exceptional AI engineers to join our core document...  ...Responsibilities: Develop, train, and optimize machine...  ...with Docker/Kubernetes and distributed systems Active participation... 
    Training
    Work at office
    Remote work

    LlamaIndex, Inc.

    San Francisco, CA
    3 days ago
  • $150k - $300k

     ...that lets anyone create, train, and deploy them. We...  ...runs the jobs. Core Technical Responsibilities Hosted...  ...Requirements We're looking for engineers who are fluent across...  ...Understanding of distributed training fundamentals...  ...and encourage team members to contribute to the broader... 
    Training
    Work at office
    Local area
    Remote work
    Visa sponsorship
    Relocation package
    Flexible hours

    Prime-Intellect

    San Francisco, CA
    4 days ago
  • $300k

     ...Member of Technical Staff - RL Algorithms About V max V max is an applied research...  ...de-facto method of post-training LLMs. We are limited by...  ...exploration failures, and distribution shift. Collaborate with researchers...  .... Demonstrated software engineering ability. Strong... 
    Training
    Work at office
    Local area
    Shift work

    VMAX LLC

    San Francisco, CA
    4 days ago
  • $300k

     ...Member of Technical Staff - RL Infrastructure About V max V max is an applied research lab developing...  ...This role is for strong infrastructure engineers who can build the systems layer for RL at scale: distributed rollouts, training orchestration, inference, evals, data... 
    Training
    Work at office
    Local area

    VMAX LLC

    San Francisco, CA
    4 days ago
  • $150k - $350k

     ...Mission Gimlet Labs is seeking a Member of Technical Staff focused on distributed systems. In this role, you will build the core platform that schedules...  ...and failure conditions. This role is well‑suited for engineers who enjoy building foundational infrastructure, understanding... 

    Gimlet Labs, Inc.

    San Francisco, CA
    4 days ago
  •  ...is bringing the rigor of distributed systems, model architecture...  ...redesigned the foundation model training stack to turn the world’s...  ...the barrier to creating engineered threats and AI-generated...  .... About the Role As a Member of Technical Staff, Infrastructure & Training... 
    Training
    Local area

    Radical Numerics Inc.

    San Francisco, CA
    13 hours ago
  • $150k

     ...rich real-world datasets to train and deploy state-of-the-art foundation...  ...existing team of platform engineers to extend the systems that...  ...modern cloud-native stack — distributed compute on Kubernetes,...  ...employees, supervisors, and staff; adhere to standards of excellence... 
    Training
    Internship
    Local area

    Amazon Technologies, Inc.

    San Francisco, CA
    3 days ago
  •  ...more than headcount. The engineers we hire today will...  ...learning role. We are not training models or tuning...  ...kernel, networking, and distributed systems engineers to...  ...come. As an early member of our team, you will...  ...work alongside highly technical engineers, and help shape... 
    Training

    Gimlet Labs

    San Francisco, CA
    2 days ago
  •  ...Member Of Technical Staff We're looking for a member of technical staff to...  ...Design scalable pipelines for training, inference, and data...  ...Master's in computer science, engineering, or related field...  ...systems Experience with distributed systems or large-scale infrastructure... 
    Training

    ERAGON

    San Francisco, CA
    3 days ago
  • $180k

     ...Member Of Technical Staff - Pre-Training Palo Alto, CA About XAI XAI's mission is to create AI systems...  ...highly motivated, and focused on engineering excellence. This organization is...  ...of scaling laws. Familiar with distributed training, multi-GPU neural network... 
    Training
    Temporary work

    Xai

    San Francisco, CA
    23 hours ago
  •  ...Member Of Technical Staff - Image / Video Generation Freiburg (Germany) About Black Forest Labs...  .... Why This Role You'll train large-scale diffusion models for image...  ...ablation. You're comfortable debugging distributed training issues and presenting research... 
    Training
    Remote work
    Worldwide
    2 days per week

    Black Forest Labs

    San Francisco, CA
    23 hours ago
  • $200k - $350k

     ...both clients and candidates. Member of Technical Staff - Pre-Training Infrastructure Location: San...  ...team sits at the intersection of distributed systems, machine learning infrastructure...  ...environments. ~ Strong systems engineering skills spanning machine learning... 
    Training
    Work at office
    Visa sponsorship

    Recruiting from Scratch

    San Francisco, CA
    23 hours ago
  • $256k - $276k

     ...The Opportunity As a Member of Technical Staff on AI Infrastructure, you will...  ...the foundational systems and distributed infrastructure that power AI model post training, inference, and data...  .... You will collaborate with engineering and research teams to ensure... 
    Training
    Work at office
    Flexible hours
    3 days per week

    Postman

    San Francisco, CA
    2 days ago
  • $200k - $350k

     ...Member Of Technical Staff, Training Infra Bay Area Ai Systems Inception creates the world's fastest, most...  .... We are the AI researchers and engineers behind such breakthrough AI technologies...  ...Design, implement, and optimize distributed training systems that scale across... 
    Training
    Immediate start
    Flexible hours

    Inception LLC

    San Francisco, CA
    1 day ago
  •  ...Member of Technical Staff – Agents at Prime Intellect – San Francisco Building...  ...compute, code or capital to train powerful, open models. Our...  ...systems that support distributed AI agent execution at scale...  ...product, research, and other engineering teams to identify key... 
    Training
    Remote work
    Flexible hours

    Victrays

    San Francisco, CA
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Member of Technical Staff - Distributed Training Engineer. Be the first to apply!