Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Software Engineer - Training Infrastructure

Baseten

Software Engineer

Baseten powers mission-critical inference for the world's most dynamic AI companies, like Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma and Writer. By uniting applied AI research, flexible infrastructure, and seamless developer tooling, we enable companies operating at the frontier of AI to bring cutting-edge models into production. We're growing quickly and recently raised our $300M Series E, backed by investors including BOND, IVP, Spark Capital, Greylock, and Conviction. Join us and help build the platform engineers turn to to ship AI products.

The Role

As a Software Engineer on the Training Infrastructure team, you'll architect and lead development of our training platform, supporting top tier research engineers and model developers. You'll make key technical decisions for the infrastructure enabling developers to deploy, scale, and monitor their workloads with high performance and reliability. You'll own scheduling, storage, networking, reliability, and observability of technical systems in the training stack.

Example Initiatives

Take a look at what we've built so far:

  • Overview of the product so far
  • Training docs overview
  • Story of the Training product
  • Research we've done
Responsibilities
  • Design and architect scalable infrastructure systems for our ML training platform (e.g. scheduling, storage, and networking)
  • Partner closely with developers and research engineers to translate complex training requirements into technical solutions
  • Design and architect a global training scheduler
  • Design and architect reinforcement learning systems and continuous learning pipelines
  • Drive long-term improvements to improve reliability of systems and velocity of development
  • Partner closely with SRE and Capacity teams to unlock state of the art training infrastructure
  • Make critical architectural decisions balancing performance with system reliability
  • Lead technical discussions and mentor junior engineers on infrastructure best practices
  • Contribute to long-term technical strategy and infrastructure roadmap
Requirements
  • Bachelor's degree or higher in Computer Science or related field
  • Proficiency in Go, with Python experience a plus
  • Deep expertise with Kubernetes in production environments
  • Extensive experience with major cloud providers (AWS, GCP) and neo-cloud providers (Crusoe, DigitalOcean, Nebius) a plus
  • Advanced understanding of distributed systems concepts and performance tuning
  • Proven experience designing observability systems
  • Experience with ML/AI workloads and MLOps platforms highly valued
Nice To Have
  • Experience with distributed storage systems
  • Experience with workload orchestration platforms like Temporal or Airflow
  • Familiarity or experience with the open source training stack and frameworks (NCCL, PyTorch, Megatron, NemoRL, VeRL, Axolotl, HF Trainier) and distributed training techniques (FSDP, DeepSpeed).
  • Experience developing AI products, tooling, or agents
Benefits
  • Competitive compensation, including meaningful equity.
  • 100% coverage of medical, dental, and vision insurance for employee and dependents
  • Flexible PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
  • Paid parental leave
  • Fertility and family-building stipend through Carrot
  • Company-facilitated 401(k)
  • Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.

Apply now to embark on a rewarding journey in shaping the future of AI! If you are a motivated individual with a passion for machine learning and a desire to be part of a collaborative and forward-thinking team, we would love to hear from you.

At Baseten, we are committed to fostering a diverse and inclusive workplace. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity or expression, national origin, age, genetic information, disability, or veteran status.

We are an Equal Opportunity Employer and will consider qualified applicants with criminal histories in a manner consistent with applicable law (by example, the requirements of the San Francisco Fair Chance Ordinance, where applicable).

Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the Software Engineer - Training Infrastructure in United States vacancy
  •  ...Staff Software Engineer Our mission is to scale intelligence to serve humanity. We're training and deploying frontier models for developers and enterprises who are building...  ...adoption of AI. The internal infrastructure team is responsible for building world-class... 
    Training
    Full time
    Work at office
    Remote work
    Flexible hours

    Cohere

    United States
    2 days ago
  •  ...democratize access to cutting‑edge AI infrastructure previously reserved for...  ...layer seamlessly routes training and inference jobs across...  ...As an Infrastructure Product Engineer, you will play a pivotal role...  ...environments. ~ Advanced software engineering skills; capable... 
    Training
    Full time
    Remote work

    Andromeda

    San Francisco, CA
    8 days ago
  •  ...benefits all of humanity. The Identity Infrastructure Engineering team sits at the core of this effort,...  .... About the Role As a Software Engineer on the Identity...  ...cloud deployments, large-scale model training, and emerging AI use cases. Implement... 
    Training
    Work at office
    Remote work
    Relocation package

    OpenAI

    United States
    3 days ago
  •  ...Compute Infrastructure Engineer Compute Infrastructure builds the platform that turns enormous amounts...  ...storage, data centers, orchestration software, agent infrastructure, developer tools...  ...capacity online, optimize training workloads from profiler traces and benchmarks... 
    Training
    Remote work

    OpenAI

    United States
    5 days ago
  •  ...As an Infrastructure Engineer, you'll build and deploy the computational infrastructure that powers...  ...administration skills Experience releasing complex software, including building and packaging...  ...operating cryptocurrency mining or ML training infrastructure at scale Familiarity... 
    Training
    Remote work
    Flexible hours

    Boundless Networks

    New York, NY
    1 day ago
  • $170k - $216k

     ...Software Engineer, Simulation Infrastructure Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver...  ..., including exact work location, experience, relevant training and education, and skill level. Your recruiter can share... 
    Training
    Full time
    Remote work

    Waymo

    Mountain View, CA
    2 days ago
  • $600 per month

     ...Senior Software Engineer, Infrastructure & Tools Austin, TX About Osano: Osano is an innovative B-Corporation built around a simple belief...  ...with significant potential You'll have access to our training program, well-defined career paths, and a leadership... 
    Training
    Remote work

    Osano

    United States
    1 day ago
  • $204k - $259k

     ...Senior Software Engineer, Simulation ML Infrastructure Waymo is an autonomous driving technology company with the mission to be the world's most trusted...  ...of realistic environments for the testing and training of the Waymo Driver. To increase the fidelity and steerability... 
    Training
    Full time
    Remote work

    Waymo

    San Francisco, CA
    4 days ago
  • $170k - $200k

     ...valuable. About The Role Zora is looking for an experienced infrastructure software engineer to work closely with the development team to ensure that...  ...related skills, experience and relevant education and training, to determine compensation that is fair and competitive for... 
    Training
    Full time
    Local area
    Remote work
    Home office
    Flexible hours

    Framework Ventures

    New York, NY
    1 day ago
  •  ...demands. About the Role: In the ML Training, our mission is to provide a reliable,...  ...the overall developer experience of ML engineers including building tools for testing, validation...  ...position may also involve working with software and technologies subject to U.S. export... 
    Training

    Stack AV

    United States
    5 days ago
  •  ...Software Engineer, ML Infrastructure Engineering · Full-time · San Francisco; New York Our mission is to automate coding. The first step in our...  ...engineers to enable their work through improvements to our training framework, systems reliability/performance, and... 
    Training
    Full time

    Anysphere

    New York, NY
    5 days ago
  •  ...functional group working across engineering, product, research, and design...  ...re looking for an experienced Software Engineer to help build the machine learning infrastructure that powers OpenAI's...  ...that enables teams to build, train, deploy, serve, monitor, and continuously... 
    Training
    Remote work

    OpenAI

    United States
    2 days ago
  •  ...personal freedom. The Department: Onchain The Role: Software Engineer (Infrastructure) The infrastructure team at Gemini creates and...  ...~ Experience working with engineering teams, teaching, training, and mentoring on how to implement best-practice technical... 
    Training
    Remote work
    Flexible hours

    WorksHub

    New York, NY
    8 days ago
  • $230k

     ...unchecked growth. About the role As a software engineer on the Fleet High Performance...  ...Minimizing hardware failure is key to research training progress and stable services, as even...  ...and efficiency of our supercomputing infrastructure. Our team empowers strong engineers... 
    Training

    OpenAI

    San Francisco, CA
    1 day ago
  •  ...About the Role We are hiring Software Engineers focused on AI Infrastructure to build the systems that enable frontier multimodal AI to operate reliably...  ...Design and build scalable infrastructure supporting training and inference workflows. Develop high-performance... 
    Training
    Internship
    Immediate start

    SpreeAI

    San Francisco, CA
    3 days ago
  • $184k - $259.44k

     ...Scale AI is seeking a highly skilled and motivated Software Engineer, Frontier AI Infrastructure to join our dynamic Public Sector Engineering team....  ...qualifications, interview performance, and relevant education or training. Scale employees in eligible roles are also granted... 
    Training
    Full time
    Work at office
    3 days per week
    Early shift

    Scale AI

    Washington DC
    4 days ago
  • $232k - $283k

     ...Senior Software Engineer 3 - (AI Infrastructure, Kubernetes, Python) Clearance: TS/SCI w/ poly Position ID: 20-24-017-SWE3 Location: Annapolis...  ...classes and will cover costs associated with job related training and certifications. Akina is committed to excellence... 
    Training
    Contract work
    Flexible hours

    Akina

    Annapolis Junction, MD
    4 days ago
  • $100k - $300k

     ...Software Engineer, Ai Training And Infrastructure Pittsburgh, San Francisco, Bengaluru Company Overview At Skild AI, we are building the world's first general purpose robotic intelligence that is robust and adapts to unseen scenarios without failing. We believe... 
    Training

    Skild AI

    Pittsburgh, PA
    15 days ago
  • $148k - $222k

     ...Senior Software Engineer – Developer Infrastructure At Klaviyo, we value the unique backgrounds, experiences and perspectives each Klaviyo (we call ourselves...  ...job-related skills, relevant experience, education or training, and work location. In addition to base salary, our... 
    Training
    Remote work

    Venturefizz Product Management Community

    United States
    3 days ago
  • $127k - $223k

     ...realistic closed-loop simulation engine built with the latest in...  ...Develop the tooling, infrastructure, and pipelines to support complex...  ...of interesting scenarios for training and evaluation. Develop and...  ...programming and strong software engineering fundamentals with... 
    Training
    Full time
    Work at office
    Remote work
    Work from home
    Flexible hours

    Waabi

    United States
    5 days ago
  • $232k - $283k

     ...Senior Software Engineer 3 - (AI Infrastructure, AWS, Kubernetes) Join us in building the next generation of AI infrastructure that will power innovation...  ...and will cover costs associated with job related training and certifications. Akina is committed to excellence... 
    Training
    Contract work
    Flexible hours

    Akina

    Annapolis Junction, MD
    4 days ago
  • $180k - $300k

     ...DatologyAI Infrastructure Engineer Models are what they eat. But a large portion of training compute is wasted training on data that are already learned, irrelevant, or even harmful, leading to worse models that cost more to train and deploy. At DatologyAI, we've... 
    Training
    Work at office
    Relocation package

    DatologyAI

    Redwood City, CA
    2 days ago
  • $180k - $250k

     ...Senior Software Engineer, Infrastructure Artemis is building the future of AI-driven defense - helping companies detect and defend themselves...  ...with AI/ML and data teams — Support GPU workloads, model training pipelines, and large-scale data warehouses (Snowflake, ClickHouse... 
    Training

    Artemis Security

    New York, NY
    7 days ago
  •  ...world's military and critical infrastructure. We are building a...  ...be combined at the speed of software, limited by only the hard constraints...  ...and optimize inference engine architecture Tune data storage...  ...Experience with PyTorch, training and fine-tuning Machine Learning... 
    Training
    Work at office
    Local area
    3 days per week

    Swoop Search

    Minneapolis, MN
    5 days ago
  • $100k - $300k

     ...Senior Software Engineer, Infrastructure Pittsburgh, San Francisco, Bengaluru Company Overview At Skild AI, we are building the world's...  ...software infrastructure and back-end services (e.g., model training infrastructure, AI developer tools, metrics dashboards).... 
    Training
    Work experience placement

    Skild AI

    Pittsburgh, PA
    5 days ago
  • $191k - $234k

     ...Software Engineer 2 - (AI Infrastructure, AWS, Kubernetes) Join us in building the next generation of AI infrastructure that will power innovation...  ...classes and will cover costs associated with job related training and certifications. Akina is committed to excellence... 
    Training
    Flexible hours

    Akina

    Annapolis Junction, MD
    4 days ago
  •  ...Software Engineer Voxel's perception system is the technical core of everything we ship. Our models detect human activity...  ...'re hiring a strong software engineer to own the ML Infrastructure that powers how Voxel trains and ships vision models. You'll build systems that... 
    Training
    Work at office
    Remote work
    Flexible hours

    Voxel

    United States
    5 days ago
  •  ...re building Helsing's first U.S.-based engineering team in Washington, DC. As an early member...  ...team, you'll architect and build the infrastructure foundation that enables our mission in...  ...workload infrastructure for builds and AI training, and co-develop Python and Rust based... 
    Training
    Local area
    Remote work
    Flexible hours

    Helsing

    Washington DC
    4 days ago
  •  ...hiring, and upskilling, from freelance AI training gigs to first internships to full-time...  ...Role Handshake is building the infrastructure layer that powers the next generation...  ...AI agents across our platform. As a Software Engineer on our Agentic Infrastructure team, you... 
    Training
    Full time
    Freelance
    Internship
    Work at office
    Remote work
    Flexible hours

    Handshake

    San Francisco, CA
    2 days ago
  •  ...Software Engineer, AI Compute Infrastructure Los Angeles, Palo Alto, San Francisco, Toronto, Singapore About HeyGen At HeyGen, our mission is to...  ...powers our state-of-the-art AI models—from multimodal training data pipelines to high-throughput, low-latency video generation... 
    Training
    Full time

    HeyGen

    Palo Alto, CA
    5 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Software Engineer - Training Infrastructure. Be the first to apply!