Distributed Training Engineer, Sora
OpenAI
Distributed Systems/ML Engineer
The Sora team is working on making video a key capability of OpenAI's foundation models. We are a hybrid research and product team that seeks to understand and expand the capabilities of our video models, while ensuring their reliability and safety. We accomplish this both through directly studying and experimenting with the models, as well as deploying them into the real-world to distribute their benefits widely.
As a Distributed Systems/ML engineer, you will work on improving the training throughput for our internal training framework and enable researchers to experiment with new ideas. This requires good engineering (for example designing, implementing, and optimizing state-of-the-art AI models), writing bug-free machine learning code (surprisingly difficult!), and acquiring deep knowledge of the performance of supercomputers. We're looking for people who love optimizing performance, understanding distributed systems, and who cannot stand having bugs in their code.
This role is based in San Francisco, CA. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.
In this role, you will:
- Collaborate with researchers to enable them to develop systems-efficient video models and architectures
- Apply the latest techniques to our internal training framework to achieve impressive hardware efficiency for our training runs
- Profile and optimize our training framework
You might thrive in this role if you:
- Have experience working with multi-modal ML pipelines
- Love diving deep into systems implementations and understanding their fundamentals in order to improve their performance and maintainability
- Have strong software engineering skills and are proficient in Python.
- Have experience understanding and optimizing training kernels
- Are passionate about understanding stable training dynamics
OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.
We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.
Background checks for applicants will be administered in accordance with applicable law, and qualified applicants with arrest or conviction records will be considered for employment consistent with those laws, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act, for US-based candidates. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations.
To notify OpenAI that you believe this job posting is non-compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance.
We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link.
OpenAI Global Applicant Privacy Policy
At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.
- ...As a Research Engineer, Distributed Data Systems, you will design and scale the infrastructure that powers large-scale multimodal training and evaluation at OpenAI. You’ll manage distributed data... ...pipelines that serve as the backbone for Sora’s rapid iteration cycles....Training
- About the Team The Sora team is pioneering multimodal capabilities for OpenAI’s foundation... ...benefit. About the Role As a Research Engineer, Distributed Data Systems, you will design and scale... ...that powers large-scale multimodal training and evaluation at OpenAI. You’ll manage...TrainingWork at officeRelocation package
- A leading AI research company in San Francisco seeks Senior/Staff Engineers skilled in distributed systems and large-scale ML training. Responsibilities include designing systems optimized for low-bandwidth conditions and implementing robust training strategies. Ideal...TrainingRemote work
$255k - $405k
...market conditions. About the Team The Sora team is pioneering multimodal capabilities... ...benefit. About the Role As a Software Engineer, Distributed Data Systems, you will design and scale... ...that powers large‑scale multimodal training and evaluation at OpenAI. You’ll manage...TrainingFull timeWork at officeLocal areaRelocation packageFlexible hours- ...foundational research on Protocol Learning : multi-participant training of foundation models where no single participant has... ...economics. We’re looking for Senior/Staff engineers with 5+ years of experience in distributed systems and ML large‑scale training. You’ll be...TrainingRemote workVisa sponsorship
- ...technology company in San Francisco is looking for a Senior Software Engineer to build scalable infrastructure for large‑scale training and fine-tuning of foundation models. You will design distributed training systems and optimize GPU utilization while collaborating...Training
- ...firm in San Francisco seeks a Staff/Principal ML Systems Engineer to enhance training performance for multimodal robotic data. You will lead efforts... .... Ideal candidates will have significant experience in distributed training, a strong background in PyTorch, and the ability...Training
- Genesis AI in San Francisco is looking for an experienced professional to optimize and build distributed training systems using PyTorch. The ideal candidate has over 8 years of experience in distributed systems, high-performance computing, and extensive expertise in Python...Training
$255k - $405k
Slope is seeking a Software Engineer for its team in San Francisco, CA. The role focuses on designing and scaling infrastructure for large-scale multimodal training. Responsibilities include managing distributed data pipelines and collaborating closely with researchers....Training$146.5k
...preferences. About the team: The ML Data Engineering team powers metadata extraction,... ...learning, data engineering, and distributed systems, collaborating closely with applied... ...related skill sets; relevant education or training; and other business and organizational...TrainingLocal areaWorldwideHome officeFlexible hours$180k - $215k
As a Backend Engineer on our application team at Windfall, you will be building the system... ...personally design and build a scalable distributed system capable of supporting Windfall’s... ..., experience, and relevant education or training. We also offer a comprehensive benefits...Training- B Capital in San Francisco is looking for an engineering professional to architect and optimize core training infrastructure for their AI models. You will work on distributed systems and large-scale data pipelines, focusing on performance and numerical stability. Successful...Training
- ...time Location Type On-site Department Engineering Our Mission Reflection’s mission is to... ...services that power our research, training, and production environments. These systems... ...environments, multi-tenant isolation. Distributed Systems Architecture: Sharding, replication...TrainingFull timeRelocation package
$146.5k - $228k
...attitude. About the team: The ML Data Engineering team powers metadata extraction, enrichment... ...learning, data engineering, and distributed systems, collaborating closely with applied... ...skill sets; relevant education or training; and other business and organizational...TrainingTemporary workLocal areaWorldwideHome officeFlexible hours$117.2k - $313.7k
...duplicating efforts. Job Category Software Engineering Job Details About Salesforce... ...and exciting components/frameworks in distributed filesystems in an ever-growing and... ...assignment, compensation, promotion, benefits, training, assessment of job performance,...TrainingImmediate startRemote work$227.2k - $417k
...Software Engineer, ML Infra & Distributed Systems (Staff & Principal) San Francisco, CA; Los Angeles, CA; New York, NY (Hybrid); USA - Remote... ...FAISS), feature stores (e.g. Feast), ElastiCache, model training orchestration, etc. Understanding of ML model training...TrainingFull timeTemporary workLocal areaRemote workFlexible hours- ...only in your community, but around the world. HDR Engineering is currently seeking an Electrical Distribution Project Manager to join our growing and nationally... ...independently and/or directing, mentoring, training, and/or supervising one or more Project Engineers,...TrainingFull timeTemporary workPart timeLocal area
- ...honest about both. Researchers and ML engineers will hand you workloads that barely run... ...Serve Models at Scale: Design and operate distributed inference systems for LLMs, optimizing... ..., and curate the datasets behind training and evaluation. The bottleneck is rarely...TrainingFlexible hours
$166k - $225k
...to improve their business. Founded by engineers — and customer obsessed — we leap at every... ...will be building the next generation distributed data storage and processing systems that... ..., relevant certifications and training, and specific work location. Based on the...TrainingLocal areaWorldwide- ...The Role We're looking for engineers with deep AI/ML and low-level systems experience... ...everyone. When you help a customer debug a training run, you'll also fix the underlying... ...performance profiling, cluster management and distributed systems. AI/ML engineering...Training
- ...Space Models or SSMs, a new primitive for training efficient, large-scale foundation... ...expertise in model innovation and systems engineering paired with a design-minded product engineering... .... Experience building large-scale distributed systems with high demands on...TrainingWork at officeVisa sponsorshipFlexible hours
- ...GPU Kernel Engineer Sciforium is an AI infrastructure company developing next-generation... ...ML frameworks used for large-scale training and inference. This role is ideal for... ...workloads. Collaborate with ML researchers, distributed systems engineers, and model-serving...TrainingFlexible hours
- ...home day is currently Tuesday. Product Engineering at Lambda is responsible for building... ..., management and maintenance. For distributed AI workloads, GPU compute power is only... ...interconnecting these systems and supporting AI training and inference at scale. Lambda's...TrainingWork at officeLocal areaWork from homeFlexible hours
$196k - $220.5k
.... We're looking for a Senior Privacy Engineer to join us in protecting Discord's users... ...Experience developing, operating, and debugging distributed systems. Experience with modern data... ..., experience, and relevant education or training. Please note that the compensation...TrainingFull timeWork experience placementRelocationRelocation package$61.99 - $72.93 per hour
...Job Title [Local 39] Operating Engineer Job Description Summary Responsible to... ...cooling towers, fan coil units, VAV, and air distribution systems, etc. • Monitor and adjust... ...• Complete all required C&W Safety Training as scheduled annually • Comply with C...TrainingMinimum wageApprenticeshipWork experience placementWork at officeLocal areaImmediate startFlexible hoursShift work- ...Technical Staff to contribute to model training pipelines and produce state-of-the-art... ...Candidates should possess strong software engineering skills, especially in Python and ML... ...like JAX and Pytorch. Experience with distributed training infrastructures is essential....TrainingRemote job
- ...bridge production and research roles focusing on large language models and code generation. Responsibilities include building distributed training systems, implementing quality assurance pipelines, and developing innovative evaluation frameworks. The ideal candidate has...TrainingFlexible hours
$167.2k - $209k
...world. DigitalOcean is seeking a Senior Engineer 2 to play a key technical role in our AI... ...and parallelization strategies across distributed GPU environments. Hardware Fluency: Comprehensive... ...reimbursement for relevant conferences, training, and education. All employees have...TrainingLocal areaRemote workWorldwideFlexible hours- ...Reinforcement Learning Environment Engineer RL Environments; MLE; LLM Tasks; Difficulty Distribution; Remote Contractor; PST Overlap (≥4h); Advanced English (C1/... ...Preference Model is building the next generation of training data to power the future of AI. Today's models...TrainingFull timeFor contractorsRemote workRelocation
$100k - $120k
...generation robotic foundation models. As training and inference workloads grow, we need... ...Responsibilities Lead a team of kernel and system engineers focused on performance-critical code... ...Integrate kernel optimizations into distributed ML frameworks (e.g., PyTorch,...Training
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Distributed Training Engineer, Sora. Be the first to apply!

