Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior ML Systems Engineer, Frameworks & Tooling

Cohere

Senior Engineer

Our mission is to scale intelligence to serve humanity. We're training and deploying frontier models for developers and enterprises who are building AI systems to power magical experiences like content generation, semantic search, RAG, and agents. We believe that our work is instrumental to the widespread adoption of AI.

We obsess over what we build. Each one of us is responsible for contributing to increasing the capabilities of our models and the value they drive for our customers. We like to work hard and move fast to do what's best for our customers.

Cohere is a team of researchers, engineers, designers, and more, who are passionate about their craft. Each person is one of the best in the world at what they do. We believe that a diverse range of perspectives is a requirement for building great products.

Join us on our mission and shape the future!

We're looking for a senior engineer to help build, maintain and evolve the training framework that powers our frontier-scale language models. This role sits at the intersection of large-scale training, distributed systems, and HPC infrastructure. You will design and maintain the core components that enable fast, reliable, and scalable model training — and build the tooling that connects research ideas to thousands of GPUs.

If you enjoy working across the full stack of ML systems, this role gives you the opportunity and autonomy to have massive impact.

What You'll Work On
  • Build and own the training framework responsible for large-scale LLM training.
  • Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing).
  • Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100).
  • Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics.
  • Collaborate closely with infra teams to ensure our cluster, container environments, and hardware configurations support high-performance training.
  • Investigate and resolve performance bottlenecks across the ML systems stack.
  • Build robust systems that ensure reproducible, debuggable, large-scale runs.
You Might Be a Good Fit If You Have
  • Strong engineering experience in large-scale distributed training or HPC systems. Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops.
  • Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar).
  • Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines.
  • Experience working with containerized environments (Docker, Singularity/Apptainer).
  • A track record of building tools that increase developer velocity for ML teams.
  • Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability.
  • Strong collaboration skills — you'll work closely with infra, research, and deployment teams.
Nice to Have
  • Experience with training LLMs or other large transformer architectures.
  • Contributions to ML frameworks (PyTorch, JAX, DeepSpeed, Megatron, xFormers, etc.).
  • Familiarity with evaluation and serving frameworks (vLLM, TensorRT-LLM, custom KV caches).
  • Experience with data pipeline optimization, sharded datasets, or caching strategies.
  • Background in performance engineering, profiling, or low-level systems.

Bonus: paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP).

Why Join Us
  • You'll work on some of the most challenging and consequential ML systems problems today.
  • You'll collaborate with a world-class team working fast and at scale.
  • You'll have end-to-end ownership over critical components of the training stack.
  • You'll shape the next generation of infrastructure for frontier-scale models.
  • You'll build tools and systems that directly accelerate research and model quality.

Sample Projects:

  • Build a high-performance data loading and caching pipeline.
  • Implement performance profiling across the ML systems stack
  • Develop internal metrics and monitoring for training runs.
  • Build reproducibility and regression testing infrastructure.
  • Develop a performant fault-tolerant distributed checkpointing system.

If some of the above doesn't line up perfectly with your experience, we still encourage you to apply!

We value and celebrate diversity and strive to create an inclusive work environment for all. We welcome applicants from all backgrounds and are committed to providing equal opportunities. Should you require any accommodations during the recruitment process, please submit an Accommodations Request Form, and we will work together to meet your needs.

Full-Time Employees at Cohere enjoy these perks:

  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)
Vacancy posted 2 days ago
Similar jobs that could be interesting for youBased on the Senior ML Systems Engineer, Frameworks & Tooling in United States vacancy
  •  ...is looking for an experienced engineer to join its SOTA Training Platform...  ..., C/C++, and deep learning frameworks. Responsibilities include bringing ML models to life on Cerebras CSX systems, performance tuning, and contributing to tool improvements. This role offers... 
    Senior

    Cerebras

    Sunnyvale, CA
    1 day ago
  • Cerebras Systems builds the world's largest AI chip, 5...  ...effortlessly run large‑scale ML applications, without...  ...and experienced engineer to join our SOTA Training...  ...prototype improvements across tools, APIs, or automation...  ...with deep learning frameworks (e.g., PyTorch, TensorFlow... 
    Senior
    Internship

    Cerebras

    Sunnyvale, CA
    1 day ago
  • $199.2k - $298.8k

     ...Senior, ML Engineer - ML Ops Framework Remote - US, Ann Arbor, MI At Torc, we have always believed that...  ...frameworks including the following tools: Cloud tooling Terraform,...  ...rerun) Design organization wide systems and platforms for ML Ops. Be a key... 
    Senior
    Full time
    Immediate start
    Remote work
    Relocation

    TORC Robotics

    United States
    4 days ago
  • $200k - $240k

     ...world for all. The AI Engineering Team is chartered...  ...Models (LLMs) and agentic systems. Our mission is to...  ...infrastructure, and operational tooling that allow AI systems...  ..., evaluation frameworks, and orchestration...  ...than the market. As a Senior or Staff ML Systems Engineer - LLM... 
    Senior
    Remote work
    Worldwide

    TRM Labs

    San Francisco, CA
    1 day ago
  • $181.1k - $318.4k

     ...Senior Systems Framework Engineer, Vision Products Group Apple is where individual imaginations gather together, committing to the values that lead...  ...decisions ~ Familiarity with Gen AI coding tools. Demonstrated ability to use them for design, code generation... 
    Senior
    Relocation

    Apple

    Sunnyvale, CA
    3 days ago
  •  ...leading AI research company in San Francisco seeks Senior/Staff Engineers skilled in distributed systems and large-scale ML training. Responsibilities include designing...  ..., and familiarity with distributed training frameworks. The role offers equity-heavy compensation, a... 
    Senior
    Remote work

    Pluralis Research

    San Francisco, CA
    5 days ago
  • $216.7k - $303.4k

     ...Senior Machine Learning Systems Engineer Remote - United States Reddit is a community of communities...  ...You’ll Do As a Senior ML Infrastructure Engineer,...  ...a data warehouse and with tools such as Apache Beam,...  ...programming languages and frameworks of ML, such as Python, PyTorch... 
    Senior
    Remote work

    Reddit

    New York, NY
    3 days ago
  •  ...Francisco is looking for a Senior Software Engineer to build scalable infrastructure...  ...distributed training systems and optimize GPU utilization...  ...over 5 years of experience in ML infrastructure and a strong...  ...in distributed training frameworks. Competitive compensation and... 
    Senior

    Baseten

    San Francisco, CA
    5 days ago
  •  ...leading robotics company in Palo Alto seeks a Staff/Principal ML Systems Engineer to enhance training performance for their innovative humanoid...  ...extensive experience in distributed training and modern ML tools, thrive in fast-paced environments, and possess strong debugging... 
    Senior

    Rhoda AI

    Palo Alto, CA
    3 days ago
  • Cynnovative is seeking a Senior ML Engineer in Arlington, Virginia to develop and manage tools for LLM experimentation and deployment...  ...You will design scalable LLM systems and ensure the reliability of...  ...have strong expertise in ML frameworks like PyTorch, containerization... 
    Senior

    Cynnovative

    Arlington, VA
    1 day ago
  • $320k - $405k

     ...Machine Learning Systems Engineer, Research Tools San Francisco, CA | New York City, NY | Seattle, WA...  ...training pipeline Create robust testing frameworks to validate tokenization systems...  ...learning systems, data pipelines, or ML infrastructure Are proficient in Python... 
    Work at office
    Visa sponsorship
    Flexible hours

    Anthropic

    Seattle, WA
    2 days ago
  •  ...Machine Learning Platform Engineer in Chicago, USA . Build...  ...and scale the internal ML platform at APPIT...  ...creating self-service tools for model training, experiment...  ...artifact management systems Optimize GPU resource...  ...distributed computing frameworks (Ray, Spark, or Dask)... 
    Senior

    Appit LLC

    Chicago, IL
    13 hours ago
  • $164k - $313.3k

     ...Photoshop ART is seeking a Senior Machine Learning (ML) Systems & Efficiency Engineer to join our R&D team focused on...  ...deep performance analysis using tools such as PyTorch Profiler and NVIDIA...  ...inferences. Build benchmarking frameworks and dashboards to guide tradeoffs... 
    Senior
    Temporary work
    Local area
    Worldwide

    Adobe

    Seattle, WA
    1 day ago
  • $144.7k - $261.3k

     ...The Role We are looking for an ML tooling engineer to build tools to analyze and optimize distillation...  ...state of the art tools like Nsight Systems, PyTorch, etc. Job Description The...  ...the field of AI/ML Experience with ML frameworks (e.g., PyTorch, TensorFlow) and NVIDIA... 
    Senior
    Local area
    Remote work
    Work from home
    Relocation package

    General Motors

    New York, NY
    3 days ago
  •  ...laboratory in San Francisco is seeking a Senior / Principal ML Engineer to enhance their ML infrastructure....  ...role involves designing experimental frameworks for data scientists, collaborating...  ...proficiency in Python and experience with tools like PyTorch and Kubernetes. The... 
    Senior

    Merge Labs

    San Francisco, CA
    1 day ago
  •  ...Senior Machine Learning Engineer - Agent Tools Interop (AU Remote) Join the team redefining how...  ...challenge: designing systems where agents reliably find...  ...platform engineering, bridging ML thinking with systems...  ...You've built evaluation frameworks that measure AI feature quality... 
    Senior
    Work at office
    Remote work
    Worldwide
    Flexible hours

    Canva

    United States
    3 days ago
  •  ...leading healthcare technology firm is seeking a Machine Learning Engineer with expertise in Natural Language Processing (NLP). The role...  ...designing efficient fine-tuning strategies, and developing specialized tools for medical necessity tasks. Candidates should have 3+ years of... 
    Senior
    Flexible hours

    Waystar, Inc

    Atlanta, GA
    3 days ago
  •  ...Job Responsibilities: Engineer, design, implement, and improve...  ...highly scalable machine learning systems and tools for enabling research...  ...coding skills, to platform and framework development projects...  ...~0-2 years of Distributed ML Training (FSDP/DDP) experience... 
    Senior
    Remote job
    Work experience placement

    SGS Consulting

    Remote
    a month ago
  •  ...multinational technology company is seeking a Sr. Machine Learning Engineer in Cambridge to advance AI and create intelligent voice-based...  ...teams. Ideal candidates have strong skills in data processing frameworks and software engineering, with a passion for building... 
    Senior

    Apple Inc.

    Cambridge, MA
    3 days ago
  • $181.1k - $318.4k

    Apple Inc. is looking for a Senior Machine Learning Engineer for the Siri Speech team in Cupertino, California, to enhance the technology used in...  ...strong experience in processing complex data, distributed frameworks, and software engineering skills, preferably in Python.... 
    Senior

    Apple Inc.

    Cupertino, CA
    4 days ago
  •  ...Senior Systems Engineer Specializing In Ai/Ml Algorithms Raytheon is seeking a Senior Systems Engineer specializing in AI/ML algorithms to support advanced...  ...theory Experience using software change management tools, version control systems, and established software... 
    Senior
    Relocation package

    Raytheon

    Plano, TX
    4 days ago
  •  ...Leading the development of annotation tools, the full-time Senior ML Tools Engineer will build and manage the tooling ecosystem for high-quality ground truth data creation, working remotely. Key responsibilities Lead the development of annotation interfaces, starting... 
    Senior
    Full time
    Remote work

    Virtual Vocations Inc

    United States
    1 day ago
  • $86.8k - $165.2k

     ...protect our world. Raytheon is seeking a seeking a Senior Systems Engineer specializing in AI/ML algorithms to support advanced aerospace sensor...  ...theory Experience using software change management tools, version control systems, and established software development... 
    Senior
    Temporary work
    Work experience placement
    Work at office
    Remote work
    Relocation package
    Flexible hours

    Raytheon Technologies

    Brenham, TX
    3 days ago
  • Akina, Inc. seeks a Software Engineer 3 to lead projects in the CNO domain, requiring expertise in Python and AI. This role involves designing and developing tools, leading projects from inception to delivery, and collaborating with stakeholders to address engineering... 
    Senior

    Akina, Inc.

    Annapolis, MD
    5 days ago
  • A leading technology company in Cupertino is seeking a senior System Frameworks Software Engineer to pioneer developments in spatial computing and augmented reality. This role is focused on creating frameworks for applications to interact intelligently with users. Key... 
    Senior

    Career-Mover

    Cupertino, CA
    5 days ago
  • $160k - $185k

    A leading interactive entertainment company in Santa Monica is seeking a core systems engineer to architect and implement development frameworks for cutting-edge games across various platforms. The ideal candidate will have extensive experience working with Unreal Engine... 
    Senior

    Skydance Media

    Santa Monica, CA
    3 days ago
  • EmergencyMD is looking for a Sr. Communication Systems Engineer in North Charleston, SC. This role involves supporting system modeling and traceability for a ML-enabled C2 tool suite focused on VLF communications systems. Ideal candidates will have a Bachelor's degree in... 
    Senior

    EmergencyMD

    North Charleston, SC
    1 day ago
  •  ...technology company in California is seeking experienced Senior/Staff Engineers proficient in building distributed ML systems. Applicants should possess strong experience in...  ...conditions, with expertise in Python and frameworks like DeepSpeed. The role emphasizes resilience... 
    Senior
    Remote work

    Pluralis Research

    California, MO
    5 days ago
  •  ...company is looking for exceptional generalist engineers who thrive with autonomy. This fully...  ...kernels to designing distributed orchestration systems. Ideal candidates will have a Bachelor's...  ...track record in systems programming or ML infrastructure. Competitive compensation... 
    Senior
    Remote work

    Inferact

    New York, NY
    3 days ago
  • Israelvcforum is seeking an experienced AI/ML Engineer in San Francisco to contribute to the Metrics Frameworks team. In this role, you’ll develop infrastructure to accelerate autonomous vehicle development and testing by creating specialized analytics frameworks. The... 
    Senior

    Israelvcforum

    San Francisco, CA
    2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior ML Systems Engineer, Frameworks & Tooling. Be the first to apply!