Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior AI Infrastructure Engineer - Training Platform

$216k - $270k
Full-time

Scale AI

As a Software Engineer on the Machine Learning Infrastructure team, you will build the "Operating System" for our large-scale GPU clusters. You will architect a high-performance training platform that handles the immense complexity of multi-thousand GPU workloads, ensuring every cycle is used efficiently. Your work directly determines the velocity at which our researchers can train and iterate on the world’s most advanced models. The ideal candidate is a systems expert who thrives on solving the orchestration, networking, and reliability challenges that emerge at massive scale. You will partner closely with researchers to build a seamless, resilient environment that transforms raw compute into breakthrough AI.

YOU WILL:

* Architect and scale a multi-tenant orchestration layer that abstracts away the complexity of GPU clusters, ensuring high utilization and seamless job recovery. * Design and implement scheduling primitives to optimize the lifecycle of training jobs. * Develop deep observability and automated health-checking into the training stack to proactively identify and isolate hardware failures * Evaluate and integrate emerging technologies in the CNCF and AI ecosystem (e.g. Ray, Kueue), making data-driven build vs. buy decisions that balance velocity with long-term maintainability. * Work closely with Finance and Procurement teams to drive our capacity planning process. * Participate in our team’s on call process to ensure the availability of our services. * Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment.

IDEALLY YOU'D HAVE:

* 5+ years of experience in backend or infrastructure engineering, with at least 2 years focused on orchestrating ML workloads at scale (100+ GPU nodes). * Strong programming skills in one or more languages (e.g. Python, Go, Rust,

C++)

* Experience with complex compute management systems that cover queueing, quotas, preemption, and gang scheduling. * Experience with distributed training infrastructure, such as EFA, Infiniband, and topology-aware scheduling. * Experience with distributed storage systems (e.g. Lustre, S3) as they relate to training throughput * Expert-level knowledge of Kubernetes internals (Custom Resources, Operators, Admission Controllers) and how they interact with device plugins for specialized hardware. * Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code (e.g., Terraform). * Proven ability to solve complex problems and work independently in fast-moving environments.

NICE TO HAVES:

  • Experience with distributed training techniques such as DeepSpeed, FSDP, etc.
  • Experience with the NVIDIA software and hardware stack (CUDA, NCCL)
  • Experience with PyTorch
  • Familiarity with post-training algorithms such as GRPO, and with
Reinforcement Learning Compensation packages at Scale for eligible roles include base salary, equity, and benefits. The range displayed on each job posting reflects the minimum and maximum target for new hire salaries for the position and may be inclusive of several career levels at Scale; it will be determined during the interview process based on work location and additional factors, including job-related skills, experience, qualifications, interview performance, and relevant education or training. Scale employees in eligible roles are also granted equity based compensation, subject to Board of Director approval. Your recruiter can share more about the specific salary range for your preferred location during the hiring process, and confirm whether the hired role will be eligible for equity grant. You'll also receive benefits including, but not limited to: comprehensive health, dental and vision coverage, retirement benefits, a learning and development stipend, and generous PTO. Additionally, this role may be eligible for additional benefits such as a commuter stipend. Please reference the job posting's subtitle for where this position will be located. For pay transparency purposes, the base salary range for this full-time position in the locations of San Francisco, New York, Seattle is:

$216,000—$270,000 USD

PLEASE NOTE: Our policy requires a 90-day waiting period before reconsidering candidates for the same role. This allows us to ensure a fair and thorough evaluation of all applicants. About Us: At Scale, our mission is to develop reliable AI systems for the world's most important decisions. Our products provide the high-quality data and full-stack technologies that power the world's leading models, and help enterprises and governments build, deploy, and oversee AI applications that deliver real impact. We work closely with industry leaders like Meta, Ernst & Young, Mayo Clinic, Time Inc., the Government of Qatar, and U.S. government agencies including the Army and Air Force. We are expanding our team to accelerate the development of AI applications. We believe that everyone should be able to bring their whole selves to work, which is why we are proud to be an inclusive and equal opportunity workplace. We are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability status, gender identity or Veteran status. We are committed to working with and providing reasonable accommodations to applicants with physical and mental disabilities. If you need assistance and/or a reasonable accommodation in the application or recruiting process due to a disability, please contact us at View email address on click.appcast.io. Please see the United States Department of Labor's Know Your Rights poster [ for additional information. We comply with the United States Department of Labor's Pay Transparency provision. PLEASE NOTE: We collect, retain and use personal data for our professional business purposes, including notifying you of job opportunities that may be of interest and sharing with our affiliates. We limit the personal data we collect to that which we believe is appropriate and necessary to manage applicants’ needs, provide our services, and comply with applicable laws. Any information we collect in connection with your application will be treated in accordance with our internal policies and programs designed to protect personal data. Please see our privacy policy [ for additional information.

Vacancy posted 5 hours ago
Similar jobs that could be interesting for youBased on the Senior AI Infrastructure Engineer - Training Platform in San Francisco, CA vacancy
  •  ...A leading AI research firm in San Francisco seeks a Staff Infrastructure Engineer to identify and resolve infrastructure bottlenecks and design large-scale systems for AI training. The ideal candidate has over 3 years of experience in infrastructure engineering and strong... 
    Senior
    Training

    Menlo Ventures

    San Francisco, CA
    4 days ago
  • $229.9k - $262.4k

     ...Senior Lead AI Engineer (Gen AI Platform Services, Agentic AI) Overview At Capital One, we are creating responsible...  .... Our investments in technology infrastructure and world‑class talent position us...  ...including foundation model training, large language model inference, similarity... 
    Senior
    Training
    Local area

    Capital One National Association

    San Francisco, CA
    3 days ago
  • An innovative AI lab is seeking an experienced engineer to manage and optimize large-scale training infrastructure. You will build core systems that support researchers, focusing on distributed training, performance optimization, and data pipelines. Ideal candidates should... 
    Senior
    Training

    Cognition Corp

    San Francisco, CA
    12 hours ago
  • Scribd is looking for a Senior AI Data Engineer to lead AI engineering efforts on the Data Platform team. The role includes building infrastructure for AI applications, aiding platform development, and mentoring engineers. Ideal candidates have 5+ years in data engineering... 
    Senior

    Scribd

    San Francisco, CA
    3 days ago
  •  ...technology products. As a Senior Lead Software Engineer at JPMorgan Chase within the Corporate Sector, Infrastructure Platforms team, you are an integral...  ...cloud platforms optimized for AI/ML workloads. Partner...  ...capabilities, and skills Formal training or certification on... 
    Senior
    Training
    For contractors

    JPMorgan Chase & Co.

    San Francisco, CA
    1 day ago
  • $196k - $220.5k

     ...nearly everyone does on our platform: play video games. Over 90%...  ...a day. As an API Platform engineer, you will continuously improve...  ...our codebase, processes, and infrastructure - affecting nearly every...  ..., and relevant education or training. Please note that the compensation... 
    Senior
    Training
    Full time
    Relocation
    Relocation package

    King River Capital Group

    San Francisco, CA
    2 days ago
  • $314.8k - $359.3k

    Senior. Distinguished AI Engineer (Agentic AI Platform) At Capital One, we are creating responsible and reliable AI systems...  ...Our investments in technology infrastructure and world-class talent — along...  ...-art techniques for optimizing training and inference software to... 
    Senior
    Training
    Full time
    Part time
    Work at office
    Local area

    Capital One

    San Francisco, CA
    2 days ago
  • $157.7k - $277.8k

     ...Type Hybrid Department Engineering, product & design...  ...enterprises orchestrate AI-powered work. Our vision...  ...With WRITER's end-to-end platform, hundreds of companies...  ...reliable, 24/7. As an Infrastructure engineer, you'll be at...  ..., personal training, etc. Learning and development... 
    Senior
    Training
    Full time
    Work at office
    Local area
    Flexible hours

    Writer

    San Francisco, CA
    2 days ago
  • A leading tech company is seeking an Infrastructure Engineer to build and scale its core platform powering AI systems. The role involves designing Kubernetes and Terraform-based infrastructures, defining standards for security and performance, and ensuring reliability.... 
    Senior

    BRAIN CORP

    San Francisco, CA
    4 days ago
  •  ...Market in San Francisco is seeking a Software Engineer to design, build, and maintain core systems for its AI platform. The role emphasizes reliability, scalability...  ...performance, requiring expertise in cloud infrastructure and CI/CD pipelines. The ideal candidate will... 
    Senior
    Flexible hours

    Neura Market

    San Francisco, CA
    2 days ago
  • Shield AI, located in San Francisco, is seeking a Principal Engineer to lead the AI data platform efforts from training to deployment in diverse environments. This pivotal role involves...  ...have substantial experience in ML infrastructure, working in both on-premise and... 
    Senior
    Training

    jobs.frontdoordefense.com - Jobboard

    San Francisco, CA
    2 days ago
  • $158.4k - $264k

     ...portfolio leadership, data engineering, infrastructure and DevOps, data / metadata / knowledge platforms, and AI/ML, GenAI, and analysis platforms...  ...for a highly skilled Senior GenAI Platform Engineer to help...  ...complex agent architectures, LLM training, optimized LLM deployments... 
    Senior
    Training
    Local area

    National Society for Black Engineers

    San Francisco, CA
    12 hours ago
  •  ...responsible and reliable AI systems, changing...  ...investments in technology infrastructure and world‑class talent...  ...applied science and engineering teams to deliver our industry...  .... Our AI models and platforms empower teams across...  ...foundation model training, large language model... 
    Senior
    Training
    Local area

    Capital One National Association

    San Francisco, CA
    2 days ago
  •  ...technology products. As a Senior Lead Software Engineer at JPMorgan Chase within the Corporate Sector, Infrastructure Platforms team, you are an integral...  ...platforms optimized for AI/ML workloads. Partner with...  ..., and skills Formal training or certification on software... 
    Senior
    Training
    For contractors

    JPMorgan Chase & Co.

    San Francisco, CA
    9 days ago
  •  ...Distinguished AI Engineer (Agentic AI Platform) At Capital One, we are creating responsible...  ...in technology infrastructure and world‑class talent —...  ...mentoring Staff, Principal and Senior engineers, authoring...  ...techniques for optimizing training and inference software to... 
    Training
    Work at office
    Local area

    Capital One National Association

    San Francisco, CA
    1 day ago
  • Monograph is seeking talented individuals to join their Ambient AI team in San Francisco, California. The role focuses on...  ...automation. As a member of a fast-paced team, you will contribute to infrastructure supporting the largest health systems and work with cutting-edge... 
    Senior

    Monograph

    San Francisco, CA
    2 days ago
  • $217k - $312.2k

     ...Senior Engineering Manager – Workspace Platform – San Francisco, California At Databricks, we are passionate...  ...running the world's best data and AI infrastructure platform so our customers can...  ...experience, relevant certifications and training, and specific work location.... 
    Senior
    Training
    Local area
    Worldwide

    Databricks Inc.

    San Francisco, CA
    2 days ago
  •  ...Meet Eloquent AI At Eloquent AI, we’re building...  ...talent in AI, engineering, and product as we...  ...services. Your Role As a Senior Software Engineer, AIOps & Infrastructure at Eloquent AI, you...  ...and AI teams to train, fine-tune, and...  ...understanding of cloud platforms (AWS, GCP, Azure)... 
    Training

    Eloquent AI

    San Francisco, CA
    3 days ago
  • Block, Inc. is seeking senior AI engineers in San Francisco to design and develop innovative conversational...  ...AI systems. The role involves training language models, collaborating with various teams, and contributing to AI infrastructure handling millions of interactions.... 
    Senior
    Training
    Full time

    Block, Inc.

    San Francisco, CA
    3 days ago
  • $190k - $270k

    AI Chopping Block, Inc. in San Francisco is seeking an AI Infrastructure Engineer to maintain user-facing services and production systems. The role involves building and managing infrastructure with tools like Ansible and Kubernetes, ensuring reliability and scalability... 
    Senior

    AI Chopping Block, Inc.

    San Francisco, CA
    3 days ago
  • EITACIES Inc. is looking for an experienced AI Infrastructure Engineer in San Francisco to support large-scale AI and machine learning platforms in cloud-native environments. The role involves designing and deploying scalable AI/ML infrastructure, managing AWS EKS Kubernetes... 
    Senior

    EITACIES

    San Francisco, CA
    12 hours ago
  • $190k - $270k

    AI Chopping Block, Inc. is seeking an AI Infrastructure Engineer in San Francisco, CA, to ensure the optimal operation of user-facing services and production systems. This role requires expertise in building infrastructure with Ansible, Terraform, and Kubernetes, along... 
    Senior

    AI Chopping Block, Inc.

    San Francisco, CA
    1 day ago
  • $194k - $239k

     ...Senior Software Engineer, Infrastructure Hover helps people design, improve, and protect the...  ...they love. With proprietary AI built on over a decade of...  ...problems while improving the platform that engineering teams...  ...cover the cost of management training, conferences, workshops,... 
    Senior
    Training
    Full time
    For contractors
    Work at office
    Local area
    Flexible hours

    HOVER Inc.

    San Francisco, CA
    3 days ago
  • Build Technologies in San Francisco is seeking a hands-on AI Engineer to develop the infrastructure and systems critical for their agentic AI platform. The ideal candidate has strong systems engineering skills, is fluent in Python, and possesses backend systems experience... 

    Build Technologies

    San Francisco, CA
    3 days ago
  • Grow Therapy in San Francisco is seeking a Senior AI Enablement Engineer to define how AI transforms operations across the organization. You will design and build foundational AI infrastructure that enhances efficiency. Responsibilities include implementing AI systems,... 
    Senior
    Flexible hours
    3 days per week

    Grow Therapy

    San Francisco, CA
    4 days ago
  • A fast-growing AI startup is seeking a Senior Infrastructure Engineer in San Francisco. In this role, you will architect and scale distributed systems that handle AI-driven phone conversations for major brands. You will contribute to optimizing ML infrastructure and integrating... 
    Senior

    Open Select

    San Francisco, CA
    12 hours ago
  • AI Chopping Block, Inc. seeks a Senior Software Engineer for their Agentic Infrastructure team in San Francisco. This role involves architecting and building AI systems that...  ...autonomous planning and execution across the platform. Ideal candidates have 4-7 years of... 
    Senior
    Remote job
    Flexible hours

    AI Chopping Block, Inc.

    San Francisco, CA
    12 hours ago
  • Handshake is seeking a Senior Software Engineer for its Agentic Infrastructure team in San Francisco. You will build the backbone for AI agents, designing key systems that ensure functionality...  ...and safety across Handshake's platform. The ideal candidate has 4-7 years of... 
    Senior
    Remote job
    Flexible hours

    Handshake

    San Francisco, CA
    1 day ago
  • AI Talent Now in San Francisco is seeking a Senior Software Engineer to build and maintain core infrastructure for a rapidly growing tech company. The ideal candidate will have 6-10 years of experience with experience in designing scalable distributed systems and comfort... 
    Senior

    AI Talent Now

    San Francisco, CA
    12 hours ago
  •  ...Associates Limited is seeking a Senior ML Infrastructure Engineer to help build and scale...  ...-based machine learning platforms. This role focuses on...  ...highly technical teams in the AI space. The ideal candidate...  ...hands-on experience with both training and inference... 
    Senior
    Training

    Hamilton Barnes Associates Limited

    San Francisco, CA
    2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior AI Infrastructure Engineer - Training Platform. Be the first to apply!