ML Systems Engineer: Distributed LLM Training & Inference
$200.8k - $251kScale AI
A leading AI technology company in San Francisco seeks a team member to build and optimize a machine learning framework for large language models. Candidates should have system optimization experience and solid software engineering skills, particularly in tools like CUDA and Pytorch. This full-time position offers a competitive salary range of $200,800 - $251,000, along with comprehensive benefits. #J-18808-Ljbffr Scale AI
$110 per hour
...Responsibilities Guide research and engineering teams to close knowledge... ...in MLOps , training infrastructure, and ML framework-level topics .... ...to MLOps and ML systems problems . Evaluate... ...training pipeline design, distributed systems reasoning, and kernel...TrainingRemote jobContract workSummer workWeekday work- ...to build and optimize their training and inference framework as part of their ML platform team. In this role... ...a strong background in system optimization and experience with distributed ML systems, along with excellent software engineering skills. Scale offers competitive...Training
$350k
...and steerable AI systems. We want AI to... ...committed researchers, engineers, policy experts,... ...Anthropic's inference fleet serves... ...model servers, distributed routing, autoscaling... ...Experience with ML systems: especially training or inference... ...infrastructure or general LLM serving stacks....TrainingVisa sponsorship- ...Learning / Software Engineer Dyania Health... ...AI, an end-to-end system that combines a medically post-trained LLM with a physician-driven... .... As a senior ML engineer at Dyania... ..., deployment, and inference at scale.... ...GPU and multi-node distributed training and inference...TrainingInternshipLocal areaRemote workFlexible hoursShift work
- ...Generating high-quality training data for national LLMs Upskilling... ...supporting end-to-end system reliability, real-time inference observability, sovereign... ...: Partner with our Engineering and ML teams to ensure the... ...agentic development, and LLM observability tools. Ownership...Training
$264.8k - $331k
...state of the art post-training algorithms to reach... ...The Enterprise ML Research Lab works... ...an ML Sys Research Engineer, you'll work on building... ...to optimize our ML system. Your customer will... ...our training and inference framework. Post-... ...least 1-3 years of LLM training in a...TrainingFull time- ...first and founding ML Operations Engineer at Tennr, you’ll play... ...Machine Learning and AI systems. You’ll own building machine learning training and inference pipelines that can... ...evaluation of ML & LLM systems. Candidate... ...alerting Experience with distributed systems, reliability...TrainingWork at office
$250k - $350k
...state of the art post‑training algorithms to reach... .... The Enterprise ML Research Lab works... ...an ML Sys Research Engineer, you’ll work on... ...to optimize our ML system. Your customer will... ...optimize our training and inference framework. Post‑... ...least 1‑3 years of LLM training in a...TrainingFull time- ...the Role Mirage is seeking an ML Engineer to push the boundaries of... ...building and extending agentic systems that understand and operate over... ...Develop novel approaches for training and adapting the large... ...understanding of transformers and modern LLM techniques Experience with...TrainingFull timeLocal areaNight shift
$200.2k - $357.5k
...Infrastructure Engineer to lead the design... ...our end-to-end ML platform... ..., and scale ML systems that improve real... ...platform spanning training,... ...batch and online inference, and edge deployment... ...EcoDriving insights, LLM-based reporting... ...experience with distributed computing frameworks...TrainingFull timeWork at officeRemote workFlexible hours- A nonprofit AI research organization in New York City seeks a full-time ML Systems Engineer. This role involves managing distributed training infrastructure, debugging complex issues, and optimizing cloud resources to enhance operational efficiency. Ideal candidates will...TrainingFull time
$227.2k - $417k
Software Engineer, ML Infra & Distributed Systems (Staff & Principal) About the Role: As a... ...world‑class machine learning inference platforms. These... ...that support Deep Learning, LLM, and Search models. This... ...Feast), ElastiCache, model training orchestration, etc. Understanding...TrainingFull timeTemporary workLocal areaFlexible hours- ...organization that puts human values first. About the Role ML Systems Engineers at Basis ensure training and evaluation infrastructure is fast, reliable, and scalable. You will own the full stack from distributed training frameworks through cloud administration, making...TrainingFull timeWork at office
- ...best AI detection systems. We publish... ...Machine Learning Engineers at all levels... ...generation, to training models, to deployment... .... At Pangram, ML engineers are... ...models Manage distributed infrastructure for multi-GPU LLM training... ...optimizing training and inference code Deploy...TrainingWork at office
- ...world of intelligent systems. Location : New York,... ...deploy production‑grade ML systems with end‑to‑... ...drive innovation in LLM and audio ML... ...preprocessing, model training, deployment, inference, and monitoring in production... ...professional experience in ML engineering. Strong programming...TrainingFull time
$85 per hour
...Larry Summers , and Jack Dorsey . Position: ML Engineer (Coding Agent Experience) Type: Contract Compensation... ...model-generated implementations involving model training , inference systems , MLOps , and LLM applications . Identify bugs, edge cases,...TrainingRemote jobContract workSummer work$200k
AI/ML & Algorithm Engineer Full Time Dedicated United States,... ...of multi-agent AI systems, novel scoring and... ...alignment. Build LLM‑powered workflows:... ...context assembly, and inference across disparate operational... ...that enable model training across distributed, domain‑owned data...TrainingFull timeTemporary workWorldwide$112.7k - $169.1k
...the machine learning systems that decide which... ...world's leading game engine. Recommendation and... ...using causal inference, A/B testing, and offline... ...full pipeline from training data to deployed... ...reinforcement learning, LLM post-training or... ...-scale data and ML systems, whether through...TrainingInternshipWork at officeWorldwideRelocation packageShift work- ...Senior Machine Learning Engineer , you will take a... ...optimising end‑to‑end ML systems that operate at massive... ...and own large-scale, distributed machine learning systems for training, deployment, inference, and monitoring.... ...Augmented Generation), and LLM application...Training
$152k - $228k
...Job Description Senior ML Engineer About Invoca Invoca... ...lifecycle at Invoca, from model training and fine-tuning through inference optimization and... ...Design and Optimize SLM/LLM Deployment: Own the full... ...foundations to keep the systems powering our models reliable...TrainingCurrently hiringRemote workFlexible hours$148.5k - $223.9k
...duplicating efforts. Job Category Software Engineering Job Details About Salesforce... ...the design and delivery of large-scale, distributed systems within Salesforce’s Security ecosystem... ...assignment, compensation, promotion, benefits, training, assessment of job performance,...TrainingFull time$135k - $170k
...and ongoing skills training opportunities Employee... ...The Senior MLOps Engineer treats ML systems as software systems... ...scoring and real‑time inference. The Senior Engineer... ...engineering‑first mindset — distributed systems,... ...Required) Integrate LLM APIs (Bedrock, Anthropic...TrainingFor contractorsH1bVisa sponsorshipFlexible hours$250k - $350k
...Applied ML Systems Engineer - Finance - NEW YORK - UNITED STATES... ...be deep in GPU kernels trying to shave training time. Other weeks you'll be whiteboarding... ...Systems," "Training Infrastructure," "Distributed Training," "GPU Optimization," "Model...TrainingPermanent employmentFull timeWork experience placementInternshipImmediate startRemote workRelocationRelocation package- ...emergency response systems and remote patient... ...a principal-level engineer: shaping unclear... ...engineering, model training, model selection,... ...Develop practical ML models that balance... ...or near-real-time inference, model versioning,... ...solutions, including LLM-powered workflows,...TrainingTemporary workRemote work
- ...Senior Systems Engineer Sales New Jersey Full-time ID: VDT27560 Description VAST Data is... ...for real-time data analysis and AI training and inference. Designed from the ground up to make... ...matter expertise on storage products, distributed storage architectures, file systems,...TrainingFull timeTraineeship
$180k - $250k
...looking for exceptional AI/ML Engineers to help shape and build it.... ...some of the world’s biggest distributed systems. A core part of this effort... ...parts of the system – from LLM‑powered classification and detection... .... Qualifications Experience training, deploying, and monitoring...Training$160k - $200k
...everywhere in healthcare. Our LLM‑powered platform is solving... ...use cases. For health systems, our first product dramatically... ...We’re hiring an exceptional ML Engineer to join our team (Boston or... ...‑to‑end ML systems (design, training, inference, deployment, and monitoring;...TrainingWork at office- ...AI/ML Engineer 1 year + 4 days onsite... ...scalable data systems and good communication... ...downstream NLP models. LLM Integration • Incorporate... ...tools and scalable inference strategies. • Prior... ...collaborating on model training and evaluation, aligning...TrainingLocal area
$150k - $198k
...Machine Learning Engineer on the Trust &... ...detection systems that have a direct... ...end lifecycle of ML projects,... ...collection to training, deployment, monitoring... ...to scalable inference and data pipelines... ...of distributed computing for inference... ...for an agent or LLM vs. a classical...TrainingWork experience placement$135k - $170k
...and ongoing skills training opportunities Employee... ...The Senior MLOps Engineer treats ML systems as software systems... ...scoring and real‑time inference. They build the platform... ...) - Integrate LLM APIs into production... ...Kubernetes or ECS, CI/CD, distributed systems debugging,...TrainingFor contractorsH1bFlexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to ML Systems Engineer: Distributed LLM Training & Inference. Be the first to apply!
- entry level machine learning engineer New York, NY
- machine learning ai engineer New York, NY
- junior machine learning research engineer New York, NY
- ai ml engineer New York, NY
- senior ml engineer New York, NY
- machine learning engineer New York, NY
- graduate machine learning engineer New York, NY
- data scientist machine learning engineer New York, NY
- computer vision machine learning engineer New York, NY
- machine learning software engineer New York, NY


