Senior AI Infrastructure Engineer - Training Platform
$216k - $270kScale AI
As a Software Engineer on the Machine Learning Infrastructure team, you will build the "Operating System" for our large-scale GPU clusters. You will architect a high-performance training platform that handles the immense complexity of multi-thousand GPU workloads, ensuring every cycle is used efficiently. Your work directly determines the velocity at which our researchers can train and iterate on the world’s most advanced models. The ideal candidate is a systems expert who thrives on solving the orchestration, networking, and reliability challenges that emerge at massive scale. You will partner closely with researchers to build a seamless, resilient environment that transforms raw compute into breakthrough AI.
YOU WILL:
* Architect and scale a multi-tenant orchestration layer that abstracts away the complexity of GPU clusters, ensuring high utilization and seamless job recovery. * Design and implement scheduling primitives to optimize the lifecycle of training jobs. * Develop deep observability and automated health-checking into the training stack to proactively identify and isolate hardware failures * Evaluate and integrate emerging technologies in the CNCF and AI ecosystem (e.g. Ray, Kueue), making data-driven build vs. buy decisions that balance velocity with long-term maintainability. * Work closely with Finance and Procurement teams to drive our capacity planning process. * Participate in our team’s on call process to ensure the availability of our services. * Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment.IDEALLY YOU'D HAVE:
* 5+ years of experience in backend or infrastructure engineering, with at least 2 years focused on orchestrating ML workloads at scale (100+ GPU nodes). * Strong programming skills in one or more languages (e.g. Python, Go, Rust,C++)
* Experience with complex compute management systems that cover queueing, quotas, preemption, and gang scheduling. * Experience with distributed training infrastructure, such as EFA, Infiniband, and topology-aware scheduling. * Experience with distributed storage systems (e.g. Lustre, S3) as they relate to training throughput * Expert-level knowledge of Kubernetes internals (Custom Resources, Operators, Admission Controllers) and how they interact with device plugins for specialized hardware. * Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code (e.g., Terraform). * Proven ability to solve complex problems and work independently in fast-moving environments.NICE TO HAVES:
- Experience with distributed training techniques such as DeepSpeed, FSDP, etc.
- Experience with the NVIDIA software and hardware stack (CUDA, NCCL)
- Experience with PyTorch
- Familiarity with post-training algorithms such as GRPO, and with
$216,000—$270,000 USD
PLEASE NOTE: Our policy requires a 90-day waiting period before reconsidering candidates for the same role. This allows us to ensure a fair and thorough evaluation of all applicants. About Us: At Scale, our mission is to develop reliable AI systems for the world's most important decisions. Our products provide the high-quality data and full-stack technologies that power the world's leading models, and help enterprises and governments build, deploy, and oversee AI applications that deliver real impact. We work closely with industry leaders like Meta, Ernst & Young, Mayo Clinic, Time Inc., the Government of Qatar, and U.S. government agencies including the Army and Air Force. We are expanding our team to accelerate the development of AI applications. We believe that everyone should be able to bring their whole selves to work, which is why we are proud to be an inclusive and equal opportunity workplace. We are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability status, gender identity or Veteran status. We are committed to working with and providing reasonable accommodations to applicants with physical and mental disabilities. If you need assistance and/or a reasonable accommodation in the application or recruiting process due to a disability, please contact us at View email address on click.appcast.io. Please see the United States Department of Labor's Know Your Rights poster [ for additional information. We comply with the United States Department of Labor's Pay Transparency provision. PLEASE NOTE: We collect, retain and use personal data for our professional business purposes, including notifying you of job opportunities that may be of interest and sharing with our affiliates. We limit the personal data we collect to that which we believe is appropriate and necessary to manage applicants’ needs, provide our services, and comply with applicable laws. Any information we collect in connection with your application will be treated in accordance with our internal policies and programs designed to protect personal data. Please see our privacy policy [ for additional information.- ...A leading AI research firm in San Francisco seeks a Staff Infrastructure Engineer to identify and resolve infrastructure bottlenecks and design large-scale systems for AI training. The ideal candidate has over 3 years of experience in infrastructure engineering and strong...SeniorTraining
$229.9k - $262.4k
...Senior Lead AI Engineer (Gen AI Platform Services, Agentic AI) Overview At Capital One, we are creating responsible... .... Our investments in technology infrastructure and world‑class talent position us... ...including foundation model training, large language model inference, similarity...SeniorTrainingLocal area- An innovative AI lab is seeking an experienced engineer to manage and optimize large-scale training infrastructure. You will build core systems that support researchers, focusing on distributed training, performance optimization, and data pipelines. Ideal candidates should...SeniorTraining
- Scribd is looking for a Senior AI Data Engineer to lead AI engineering efforts on the Data Platform team. The role includes building infrastructure for AI applications, aiding platform development, and mentoring engineers. Ideal candidates have 5+ years in data engineering...Senior
- ...technology products. As a Senior Lead Software Engineer at JPMorgan Chase within the Corporate Sector, Infrastructure Platforms team, you are an integral... ...cloud platforms optimized for AI/ML workloads. Partner... ...capabilities, and skills Formal training or certification on...SeniorTrainingFor contractors
$196k - $220.5k
...nearly everyone does on our platform: play video games. Over 90%... ...a day. As an API Platform engineer, you will continuously improve... ...our codebase, processes, and infrastructure - affecting nearly every... ..., and relevant education or training. Please note that the compensation...SeniorTrainingFull timeRelocationRelocation package$314.8k - $359.3k
Senior. Distinguished AI Engineer (Agentic AI Platform) At Capital One, we are creating responsible and reliable AI systems... ...Our investments in technology infrastructure and world-class talent — along... ...-art techniques for optimizing training and inference software to...SeniorTrainingFull timePart timeWork at officeLocal area$157.7k - $277.8k
...Type Hybrid Department Engineering, product & design... ...enterprises orchestrate AI-powered work. Our vision... ...With WRITER's end-to-end platform, hundreds of companies... ...reliable, 24/7. As an Infrastructure engineer, you'll be at... ..., personal training, etc. Learning and development...SeniorTrainingFull timeWork at officeLocal areaFlexible hours- A leading tech company is seeking an Infrastructure Engineer to build and scale its core platform powering AI systems. The role involves designing Kubernetes and Terraform-based infrastructures, defining standards for security and performance, and ensuring reliability....Senior
- ...Market in San Francisco is seeking a Software Engineer to design, build, and maintain core systems for its AI platform. The role emphasizes reliability, scalability... ...performance, requiring expertise in cloud infrastructure and CI/CD pipelines. The ideal candidate will...SeniorFlexible hours
- Shield AI, located in San Francisco, is seeking a Principal Engineer to lead the AI data platform efforts from training to deployment in diverse environments. This pivotal role involves... ...have substantial experience in ML infrastructure, working in both on-premise and...SeniorTraining
$158.4k - $264k
...portfolio leadership, data engineering, infrastructure and DevOps, data / metadata / knowledge platforms, and AI/ML, GenAI, and analysis platforms... ...for a highly skilled Senior GenAI Platform Engineer to help... ...complex agent architectures, LLM training, optimized LLM deployments...SeniorTrainingLocal area- ...responsible and reliable AI systems, changing... ...investments in technology infrastructure and world‑class talent... ...applied science and engineering teams to deliver our industry... .... Our AI models and platforms empower teams across... ...foundation model training, large language model...SeniorTrainingLocal area
- ...technology products. As a Senior Lead Software Engineer at JPMorgan Chase within the Corporate Sector, Infrastructure Platforms team, you are an integral... ...platforms optimized for AI/ML workloads. Partner with... ..., and skills Formal training or certification on software...SeniorTrainingFor contractors
- ...Distinguished AI Engineer (Agentic AI Platform) At Capital One, we are creating responsible... ...in technology infrastructure and world‑class talent —... ...mentoring Staff, Principal and Senior engineers, authoring... ...techniques for optimizing training and inference software to...TrainingWork at officeLocal area
- Monograph is seeking talented individuals to join their Ambient AI team in San Francisco, California. The role focuses on... ...automation. As a member of a fast-paced team, you will contribute to infrastructure supporting the largest health systems and work with cutting-edge...Senior
$217k - $312.2k
...Senior Engineering Manager – Workspace Platform – San Francisco, California At Databricks, we are passionate... ...running the world's best data and AI infrastructure platform so our customers can... ...experience, relevant certifications and training, and specific work location....SeniorTrainingLocal areaWorldwide- ...Meet Eloquent AI At Eloquent AI, we’re building... ...talent in AI, engineering, and product as we... ...services. Your Role As a Senior Software Engineer, AIOps & Infrastructure at Eloquent AI, you... ...and AI teams to train, fine-tune, and... ...understanding of cloud platforms (AWS, GCP, Azure)...Training
- Block, Inc. is seeking senior AI engineers in San Francisco to design and develop innovative conversational... ...AI systems. The role involves training language models, collaborating with various teams, and contributing to AI infrastructure handling millions of interactions....SeniorTrainingFull time
$190k - $270k
AI Chopping Block, Inc. in San Francisco is seeking an AI Infrastructure Engineer to maintain user-facing services and production systems. The role involves building and managing infrastructure with tools like Ansible and Kubernetes, ensuring reliability and scalability...Senior- EITACIES Inc. is looking for an experienced AI Infrastructure Engineer in San Francisco to support large-scale AI and machine learning platforms in cloud-native environments. The role involves designing and deploying scalable AI/ML infrastructure, managing AWS EKS Kubernetes...Senior
$190k - $270k
AI Chopping Block, Inc. is seeking an AI Infrastructure Engineer in San Francisco, CA, to ensure the optimal operation of user-facing services and production systems. This role requires expertise in building infrastructure with Ansible, Terraform, and Kubernetes, along...Senior$194k - $239k
...Senior Software Engineer, Infrastructure Hover helps people design, improve, and protect the... ...they love. With proprietary AI built on over a decade of... ...problems while improving the platform that engineering teams... ...cover the cost of management training, conferences, workshops,...SeniorTrainingFull timeFor contractorsWork at officeLocal areaFlexible hours- Build Technologies in San Francisco is seeking a hands-on AI Engineer to develop the infrastructure and systems critical for their agentic AI platform. The ideal candidate has strong systems engineering skills, is fluent in Python, and possesses backend systems experience...
- Grow Therapy in San Francisco is seeking a Senior AI Enablement Engineer to define how AI transforms operations across the organization. You will design and build foundational AI infrastructure that enhances efficiency. Responsibilities include implementing AI systems,...SeniorFlexible hours3 days per week
- A fast-growing AI startup is seeking a Senior Infrastructure Engineer in San Francisco. In this role, you will architect and scale distributed systems that handle AI-driven phone conversations for major brands. You will contribute to optimizing ML infrastructure and integrating...Senior
- AI Chopping Block, Inc. seeks a Senior Software Engineer for their Agentic Infrastructure team in San Francisco. This role involves architecting and building AI systems that... ...autonomous planning and execution across the platform. Ideal candidates have 4-7 years of...SeniorRemote jobFlexible hours
- Handshake is seeking a Senior Software Engineer for its Agentic Infrastructure team in San Francisco. You will build the backbone for AI agents, designing key systems that ensure functionality... ...and safety across Handshake's platform. The ideal candidate has 4-7 years of...SeniorRemote jobFlexible hours
- AI Talent Now in San Francisco is seeking a Senior Software Engineer to build and maintain core infrastructure for a rapidly growing tech company. The ideal candidate will have 6-10 years of experience with experience in designing scalable distributed systems and comfort...Senior
- ...Associates Limited is seeking a Senior ML Infrastructure Engineer to help build and scale... ...-based machine learning platforms. This role focuses on... ...highly technical teams in the AI space. The ideal candidate... ...hands-on experience with both training and inference...SeniorTraining
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior AI Infrastructure Engineer - Training Platform. Be the first to apply!
- ai research engineer San Francisco, CA
- ai developer San Francisco, CA
- ai prompt engineer San Francisco, CA
- ai engineer San Francisco, CA
- senior ai engineer San Francisco, CA
- ai ml engineer San Francisco, CA
- ai engineer remote San Francisco, CA
- machine learning ai engineer San Francisco, CA
- security infrastructure engineer San Francisco, CA
- infrastructure engineer San Francisco, CA


