Member of Technical Staff - Distributed Training Engineer
Gravity Engineering Services Pvt Ltd.
About Liquid AI Spun out of MIT CSAIL, we build general-purpose AI systems that run efficiently across deployment targets, from data center accelerators to on-device hardware, ensuring low latency, minimal memory usage, privacy, and reliability. We partner with enterprises across consumer electronics, automotive, life sciences, and financial services. We are scaling rapidly and need exceptional people to help us get there. The Opportunity Our Training Infrastructure team is building the distributed systems that power our next-generation Liquid Foundation Models. As we scale, we need to design, implement, and optimize the infrastructure that enables large-scale training. This is a high-ownership training systems role focused on runtime/performance/reliability (not a general platform/SRE role). You’ll work on a small team with fast feedback loops, building critical systems from the ground up rather than inheriting mature infrastructure. What We're Looking For Loves distributed systems complexity: Our team builds systems that keeps long training runs stable, debugs training failures across GPU clusters, and improves performance. Wants to build: We need builders who find satisfaction in robust, fast, reliable infrastructure. Thrives in ambiguity: Our systems support model architectures that are still evolving. We make decisions with incomplete information and iterate quickly. Aligns with team priorities and delivers: Our best engineers align with team priorities while pushing back with data when they see problems. The Work Design and build core systems that make large training runs fast and reliable Build scalable distributed training infrastructure for GPU clusters Implement and tune parallelism/sharding strategies for evolving architectures Optimize distributed efficiency (topology-aware collectives, comm/compute overlap, straggler mitigation) Build data loading systems that eliminate I/O bottlenecks for multimodal datasets Develop checkpointing mechanisms balancing memory constraints with recovery needs Create monitoring, profiling, and debugging tools for training stability and performance Desired Experience Must-have: Hands-on experience building distributed training infrastructure (PyTorch Distributed DDP/FSDP, DeepSpeed ZeRO, Megatron-LM TP/PP) Experience diagnosing performance bottlenecks and failure modes (profiling, NCCL/collectives issues, hangs, OOMs, stragglers) Understanding of hardware accelerators and networking topologies Experience optimizing data pipelines for ML workloads Nice-to-have: MoE (Mixture of Experts) training experience Large-scale distributed training (100+ GPUs) Open-source contributions to training infrastructure projects What Success Looks Like (Year One) Training throughput has increased Overall training efficiency/cost has improved Training stability has improved (fewer failures, faster recovery) Data loading bottlenecks are eliminated for multimodal workloads #J-18808-Ljbffr
- ...time Location Type On-site Department Engineering Our Mission Reflection’s mission is to... ...shared services that power our research, training, and production environments. These systems... ...environments, multi-tenant isolation. Distributed Systems Architecture: Sharding,...TrainingFull timeRelocation package
$60k - $240k
...Member of Technical Staff - Data Quality Engineer (Pre-training) Join to apply for the Member of Technical Staff - Data Quality Engineer (Pre-training) role at Reflection AI Our Mission Reflection’s mission is to build open superintelligence and make it accessible to all...TrainingRelocation package$150k
...real‑world datasets to train and deploy state‑of‑... ...implement, and optimize distributed training systems that... ...tools. Participate in technical discussions about new... ...tech lead, or leading an engineering team. Expertise in... ...employees, supervisors, and staff; adhere to standards...TrainingInternshipLocal area$150k - $300k
...enables anyone to create, train, and deploy them. We... ...training stack. Core Technical Responsibilities LLM... ...throughput/latency SLOs. Model Distribution: Optimize model... ...: LLM Inference engine development and integration... ...development and encourage team members to contribute to the...TrainingWork at officeRemote workVisa sponsorshipRelocation packageFlexible hoursShift work$150k - $300k
...enables anyone to create, train, and deploy them. We... ...management You will work on a distributed system with performance engineering at its core. The role... ...reliable at scale. Core Technical Responsibilities... ...development and encourage team members to contribute to the broader...TrainingWork at officeRemote workVisa sponsorshipRelocation packageFlexible hours- ...foundational infrastructure to train specialized AI agents. We... ...one seamless system. As a Member of Technical Staff, Infrastructure / DevOps,... ...stack. Partner with research engineers to turn experimental... ...experience building or operating distributed systems, cloud infrastructure...Training
$300k
...Member of Technical Staff - Mechanistic Interpretability About V max V max is... ...interpretability to extract useful training signals from the internal... ...RL infrastructure, distributed training, experiment tracking... .... Demonstrated software engineering ability, especially in...TrainingWork at officeLocal area- ...Member of Technical Staff, Applied Research About Us At Fireworks, we’re building... ...intersection of ML research, systems engineering, and customer‑facing problem... ..., algorithms, concurrency, distributed systems, networking. Hands‑on experience training, fine‑tuning, or evaluating...Training
- ...Pixeltable Inc. Member of Technical Staff San Francisco, CA·Full time Apply for... ...a founding member of the engineering team, you will impact the... ...ingestion, transformation, training/fine-tuning, and inference... ...experience in an industry setting: distributed data systems, cloud...TrainingFull timePart timeWork at officeWork from homeFlexible hours2 days per week
- ...Member of Technical Staff - Post‑Training Join to apply for the Member of Technical Staff - Post‑Training role... ...general agents. Drive research and engineering initiatives that push the frontier... ...into complex ML codebases and distributed systems. Experience improving model...TrainingFull timeRelocation package
- ...to serve humanity. We’re training and deploying frontier models... ...is a team of researchers, engineers, designers, and more, who... ...and shape the future! Member of Technical Staff, Search Why this role? We... ...Experience using large-scale distributed training strategies with GPUs...TrainingFull timeWork at officeRemote workFlexible hours
- ...to serve humanity. We’re training and deploying frontier models... ...is a team of researchers, engineers, designers, and more, who... ...remote-friendly! As a Member of Technical Staff, you will: Design and write... ...XLA/MLIR. Experience with distributed training infrastructures (...TrainingFull timeWork at officeRemote workFlexible hours
- ...is bringing the rigor of distributed systems, model architecture... ...redesigned the foundation model training stack to turn the world’s... ...the barrier to creating engineered threats and AI-generated... .... About the Role As a Member of Technical Staff, Mechanistic Interpretability...TrainingLocal area
- ...is bringing the rigor of distributed systems, model architecture... ...redesigned the foundation model training stack to turn the world’s... ...the barrier to creating engineered threats and AI-generated... .... About the Role As a Member of Technical Staff, Pre-Training Science at Radical...TrainingLocal area
- ...robotic intelligence. As a Member of Technical Staff, you'll be at the forefront... ...working closely with robotics engineers to integrate your solutions... ...real‑world datasets to train and deploy state‑of‑the‑art... ...learning Familiarity with distributed training systems Track record...TrainingLocal area
- ...Member of Technical Staff, Document Understanding Join us and help shape the future... ...are seeking exceptional AI engineers to join our core document... ...Responsibilities: Develop, train, and optimize machine... ...with Docker/Kubernetes and distributed systems Active participation...TrainingWork at officeRemote work
$150k - $300k
...that lets anyone create, train, and deploy them. We... ...runs the jobs. Core Technical Responsibilities Hosted... ...Requirements We're looking for engineers who are fluent across... ...Understanding of distributed training fundamentals... ...and encourage team members to contribute to the broader...TrainingWork at officeLocal areaRemote workVisa sponsorshipRelocation packageFlexible hours$300k
...Member of Technical Staff - RL Algorithms About V max V max is an applied research... ...de-facto method of post-training LLMs. We are limited by... ...exploration failures, and distribution shift. Collaborate with researchers... .... Demonstrated software engineering ability. Strong...TrainingWork at officeLocal areaShift work$300k
...Member of Technical Staff - RL Infrastructure About V max V max is an applied research lab developing... ...This role is for strong infrastructure engineers who can build the systems layer for RL at scale: distributed rollouts, training orchestration, inference, evals, data...TrainingWork at officeLocal area$150k - $350k
...Mission Gimlet Labs is seeking a Member of Technical Staff focused on distributed systems. In this role, you will build the core platform that schedules... ...and failure conditions. This role is well‑suited for engineers who enjoy building foundational infrastructure, understanding...- ...is bringing the rigor of distributed systems, model architecture... ...redesigned the foundation model training stack to turn the world’s... ...the barrier to creating engineered threats and AI-generated... .... About the Role As a Member of Technical Staff, Infrastructure & Training...TrainingLocal area
$150k
...rich real-world datasets to train and deploy state-of-the-art foundation... ...existing team of platform engineers to extend the systems that... ...modern cloud-native stack — distributed compute on Kubernetes,... ...employees, supervisors, and staff; adhere to standards of excellence...TrainingInternshipLocal area- ...more than headcount. The engineers we hire today will... ...learning role. We are not training models or tuning... ...kernel, networking, and distributed systems engineers to... ...come. As an early member of our team, you will... ...work alongside highly technical engineers, and help shape...Training
- ...Member Of Technical Staff We're looking for a member of technical staff to... ...Design scalable pipelines for training, inference, and data... ...Master's in computer science, engineering, or related field... ...systems Experience with distributed systems or large-scale infrastructure...Training
$180k
...Member Of Technical Staff - Pre-Training Palo Alto, CA About XAI XAI's mission is to create AI systems... ...highly motivated, and focused on engineering excellence. This organization is... ...of scaling laws. Familiar with distributed training, multi-GPU neural network...TrainingTemporary work- ...Member Of Technical Staff - Image / Video Generation Freiburg (Germany) About Black Forest Labs... .... Why This Role You'll train large-scale diffusion models for image... ...ablation. You're comfortable debugging distributed training issues and presenting research...TrainingRemote workWorldwide2 days per week
$200k - $350k
...both clients and candidates. Member of Technical Staff - Pre-Training Infrastructure Location: San... ...team sits at the intersection of distributed systems, machine learning infrastructure... ...environments. ~ Strong systems engineering skills spanning machine learning...TrainingWork at officeVisa sponsorship$256k - $276k
...The Opportunity As a Member of Technical Staff on AI Infrastructure, you will... ...the foundational systems and distributed infrastructure that power AI model post training, inference, and data... .... You will collaborate with engineering and research teams to ensure...TrainingWork at officeFlexible hours3 days per week$200k - $350k
...Member Of Technical Staff, Training Infra Bay Area Ai Systems Inception creates the world's fastest, most... .... We are the AI researchers and engineers behind such breakthrough AI technologies... ...Design, implement, and optimize distributed training systems that scale across...TrainingImmediate startFlexible hours- ...Member of Technical Staff – Agents at Prime Intellect – San Francisco Building... ...compute, code or capital to train powerful, open models. Our... ...systems that support distributed AI agent execution at scale... ...product, research, and other engineering teams to identify key...TrainingRemote workFlexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Member of Technical Staff - Distributed Training Engineer. Be the first to apply!
- remote support technician San Francisco, CA
- personal computer support technician San Francisco, CA
- customer support analyst San Francisco, CA
- systems support technician San Francisco, CA
- help desk administrator San Francisco, CA
- decision support analyst San Francisco, CA
- technical support assistant San Francisco, CA
- technical analyst San Francisco, CA
- technical assistant San Francisco, CA
- IT support technician San Francisco, CA

