Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Distributed Training Engineer, Sora

OpenAI

Distributed Systems/ML Engineer

The Sora team is working on making video a key capability of OpenAI's foundation models. We are a hybrid research and product team that seeks to understand and expand the capabilities of our video models, while ensuring their reliability and safety. We accomplish this both through directly studying and experimenting with the models, as well as deploying them into the real-world to distribute their benefits widely.

As a Distributed Systems/ML engineer, you will work on improving the training throughput for our internal training framework and enable researchers to experiment with new ideas. This requires good engineering (for example designing, implementing, and optimizing state-of-the-art AI models), writing bug-free machine learning code (surprisingly difficult!), and acquiring deep knowledge of the performance of supercomputers. We're looking for people who love optimizing performance, understanding distributed systems, and who cannot stand having bugs in their code.

This role is based in San Francisco, CA. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.

In this role, you will:

  • Collaborate with researchers to enable them to develop systems-efficient video models and architectures
  • Apply the latest techniques to our internal training framework to achieve impressive hardware efficiency for our training runs
  • Profile and optimize our training framework

You might thrive in this role if you:

  • Have experience working with multi-modal ML pipelines
  • Love diving deep into systems implementations and understanding their fundamentals in order to improve their performance and maintainability
  • Have strong software engineering skills and are proficient in Python.
  • Have experience understanding and optimizing training kernels
  • Are passionate about understanding stable training dynamics

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.

We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.

Background checks for applicants will be administered in accordance with applicable law, and qualified applicants with arrest or conviction records will be considered for employment consistent with those laws, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act, for US-based candidates. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations.

To notify OpenAI that you believe this job posting is non-compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance.

We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link.

OpenAI Global Applicant Privacy Policy

At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.

Vacancy posted 17 hours ago
Similar jobs that could be interesting for youBased on the Distributed Training Engineer, Sora in San Francisco, CA vacancy
  •  ...As a Research Engineer, Distributed Data Systems, you will design and scale the infrastructure that powers large-scale multimodal training and evaluation at OpenAI. You’ll manage distributed data...  ...pipelines that serve as the backbone for Sora’s rapid iteration cycles.... 
    Training

    OpenAI

    San Francisco, CA
    1 day ago
  • About the Team The Sora team is pioneering multimodal capabilities for OpenAI’s foundation...  ...benefit. About the Role As a Research Engineer, Distributed Data Systems, you will design and scale...  ...that powers large-scale multimodal training and evaluation at OpenAI. You’ll manage... 
    Training
    Work at office
    Relocation package

    OpenAI

    San Francisco, CA
    1 day ago
  • A leading AI research company in San Francisco seeks Senior/Staff Engineers skilled in distributed systems and large-scale ML training. Responsibilities include designing systems optimized for low-bandwidth conditions and implementing robust training strategies. Ideal... 
    Training
    Remote work

    Pluralis Research

    San Francisco, CA
    1 day ago
  • $255k - $405k

     ...market conditions. About the Team The Sora team is pioneering multimodal capabilities...  ...benefit. About the Role As a Software Engineer, Distributed Data Systems, you will design and scale...  ...that powers large‑scale multimodal training and evaluation at OpenAI. You’ll manage... 
    Training
    Full time
    Work at office
    Local area
    Relocation package
    Flexible hours

    Slope

    San Francisco, CA
    1 day ago
  •  ...foundational research on Protocol Learning : multi-participant training of foundation models where no single participant has...  ...economics. We’re looking for Senior/Staff engineers with 5+ years of experience in distributed systems and ML large‑scale training. You’ll be... 
    Training
    Remote work
    Visa sponsorship

    Pluralis Research

    San Francisco, CA
    1 day ago
  •  ...technology company in San Francisco is looking for a Senior Software Engineer to build scalable infrastructure for large‑scale training and fine-tuning of foundation models. You will design distributed training systems and optimize GPU utilization while collaborating... 
    Training

    Baseten

    San Francisco, CA
    1 day ago
  •  ...firm in San Francisco seeks a Staff/Principal ML Systems Engineer to enhance training performance for multimodal robotic data. You will lead efforts...  .... Ideal candidates will have significant experience in distributed training, a strong background in PyTorch, and the ability... 
    Training

    Maxwell Bond

    San Francisco, CA
    3 days ago
  • Genesis AI in San Francisco is looking for an experienced professional to optimize and build distributed training systems using PyTorch. The ideal candidate has over 8 years of experience in distributed systems, high-performance computing, and extensive expertise in Python... 
    Training

    Genesis AI

    San Francisco, CA
    1 day ago
  • $255k - $405k

    Slope is seeking a Software Engineer for its team in San Francisco, CA. The role focuses on designing and scaling infrastructure for large-scale multimodal training. Responsibilities include managing distributed data pipelines and collaborating closely with researchers.... 
    Training

    Slope

    San Francisco, CA
    1 day ago
  • $146.5k

     ...preferences. About the team: The ML Data Engineering team powers metadata extraction,...  ...learning, data engineering, and distributed systems, collaborating closely with applied...  ...related skill sets; relevant education or training; and other business and organizational... 
    Training
    Local area
    Worldwide
    Home office
    Flexible hours

    Scribd

    San Francisco, CA
    1 day ago
  • $180k - $215k

    As a Backend Engineer on our application team at Windfall, you will be building the system...  ...personally design and build a scalable distributed system capable of supporting Windfall’s...  ..., experience, and relevant education or training. We also offer a comprehensive benefits... 
    Training

    Windfall Data, Inc.

    San Francisco, CA
    8 hours ago
  • B Capital in San Francisco is looking for an engineering professional to architect and optimize core training infrastructure for their AI models. You will work on distributed systems and large-scale data pipelines, focusing on performance and numerical stability. Successful... 
    Training

    B Capital

    San Francisco, CA
    2 days ago
  •  ...time Location Type On-site Department Engineering Our Mission Reflection’s mission is to...  ...services that power our research, training, and production environments. These systems...  ...environments, multi-tenant isolation. Distributed Systems Architecture: Sharding, replication... 
    Training
    Full time
    Relocation package

    B Capital

    San Francisco, CA
    1 day ago
  • $146.5k - $228k

     ...attitude. About the team: The ML Data Engineering team powers metadata extraction, enrichment...  ...learning, data engineering, and distributed systems, collaborating closely with applied...  ...skill sets; relevant education or training; and other business and organizational... 
    Training
    Temporary work
    Local area
    Worldwide
    Home office
    Flexible hours

    Scribd

    San Francisco, CA
    2 days ago
  • $117.2k - $313.7k

     ...duplicating efforts. Job Category Software Engineering Job Details About Salesforce...  ...and exciting components/frameworks in distributed filesystems in an ever-growing and...  ...assignment, compensation, promotion, benefits, training, assessment of job performance,... 
    Training
    Immediate start
    Remote work

    Salesforce

    San Francisco, CA
    4 days ago
  • $227.2k - $417k

     ...Software Engineer, ML Infra & Distributed Systems (Staff & Principal) San Francisco, CA; Los Angeles, CA; New York, NY (Hybrid); USA - Remote...  ...FAISS), feature stores (e.g. Feast), ElastiCache, model training orchestration, etc. Understanding of ML model training... 
    Training
    Full time
    Temporary work
    Local area
    Remote work
    Flexible hours

    Tubi

    San Francisco, CA
    3 days ago
  •  ...only in your community, but around the world. HDR Engineering is currently seeking an Electrical Distribution Project Manager to join our growing and nationally...  ...independently and/or directing, mentoring, training, and/or supervising one or more Project Engineers,... 
    Training
    Full time
    Temporary work
    Part time
    Local area

    HDR

    San Francisco, CA
    3 days ago
  •  ...honest about both. Researchers and ML engineers will hand you workloads that barely run...  ...Serve Models at Scale: Design and operate distributed inference systems for LLMs, optimizing...  ..., and curate the datasets behind training and evaluation. The bottleneck is rarely... 
    Training
    Flexible hours

    Adaption

    San Francisco, CA
    1 day ago
  • $166k - $225k

     ...to improve their business. Founded by engineers — and customer obsessed — we leap at every...  ...will be building the next generation distributed data storage and processing systems that...  ..., relevant certifications and training, and specific work location. Based on the... 
    Training
    Local area
    Worldwide

    Databricks Inc.

    San Francisco, CA
    3 days ago
  •  ...The Role We're looking for engineers with deep AI/ML and low-level systems experience...  ...everyone. When you help a customer debug a training run, you'll also fix the underlying...  ...performance profiling, cluster management and distributed systems. AI/ML engineering... 
    Training

    Modal

    San Francisco, CA
    3 days ago
  •  ...Space Models or SSMs, a new primitive for training efficient, large-scale foundation...  ...expertise in model innovation and systems engineering paired with a design-minded product engineering...  .... Experience building large-scale distributed systems with high demands on... 
    Training
    Work at office
    Visa sponsorship
    Flexible hours

    Cartesia, Inc.

    San Francisco, CA
    1 day ago
  •  ...GPU Kernel Engineer Sciforium is an AI infrastructure company developing next-generation...  ...ML frameworks used for large-scale training and inference. This role is ideal for...  ...workloads. Collaborate with ML researchers, distributed systems engineers, and model-serving... 
    Training
    Flexible hours

    Sciforium

    San Francisco, CA
    3 days ago
  •  ...home day is currently Tuesday. Product Engineering at Lambda is responsible for building...  ..., management and maintenance. For distributed AI workloads, GPU compute power is only...  ...interconnecting these systems and supporting AI training and inference at scale. Lambda's... 
    Training
    Work at office
    Local area
    Work from home
    Flexible hours

    Lambda

    San Francisco, CA
    3 days ago
  • $196k - $220.5k

     .... We're looking for a Senior Privacy Engineer to join us in protecting Discord's users...  ...Experience developing, operating, and debugging distributed systems. Experience with modern data...  ..., experience, and relevant education or training. Please note that the compensation... 
    Training
    Full time
    Work experience placement
    Relocation
    Relocation package

    Discord

    San Francisco, CA
    2 days ago
  • $61.99 - $72.93 per hour

     ...Job Title [Local 39] Operating Engineer Job Description Summary Responsible to...  ...cooling towers, fan coil units, VAV, and air distribution systems, etc. • Monitor and adjust...  ...• Complete all required C&W Safety Training as scheduled annually • Comply with C... 
    Training
    Minimum wage
    Apprenticeship
    Work experience placement
    Work at office
    Local area
    Immediate start
    Flexible hours
    Shift work

    Cushman & Wakefield

    San Francisco, CA
    2 days ago
  •  ...Technical Staff to contribute to model training pipelines and produce state-of-the-art...  ...Candidates should possess strong software engineering skills, especially in Python and ML...  ...like JAX and Pytorch. Experience with distributed training infrastructures is essential.... 
    Training
    Remote job

    Jaide Health

    San Francisco, CA
    2 days ago
  •  ...bridge production and research roles focusing on large language models and code generation. Responsibilities include building distributed training systems, implementing quality assurance pipelines, and developing innovative evaluation frameworks. The ideal candidate has... 
    Training
    Flexible hours

    Code Metal

    San Francisco, CA
    17 hours ago
  • $167.2k - $209k

     ...world. DigitalOcean is seeking a Senior Engineer 2 to play a key technical role in our AI...  ...and parallelization strategies across distributed GPU environments. Hardware Fluency: Comprehensive...  ...reimbursement for relevant conferences, training, and education. All employees have... 
    Training
    Local area
    Remote work
    Worldwide
    Flexible hours

    DigitalOcean

    San Francisco, CA
    4 days ago
  •  ...Reinforcement Learning Environment Engineer RL Environments; MLE; LLM Tasks; Difficulty Distribution; Remote Contractor; PST Overlap (≥4h); Advanced English (C1/...  ...Preference Model is building the next generation of training data to power the future of AI. Today's models... 
    Training
    Full time
    For contractors
    Remote work
    Relocation

    Open Data Science

    San Francisco, CA
    1 day ago
  • $100k - $120k

     ...generation robotic foundation models. As training and inference workloads grow, we need...  ...Responsibilities Lead a team of kernel and system engineers focused on performance-critical code...  ...Integrate kernel optimizations into distributed ML frameworks (e.g., PyTorch,... 
    Training

    Coda Robotics

    San Francisco, CA
    4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Distributed Training Engineer, Sora. Be the first to apply!