Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Software Engineer, Distributed Training, AI Infrastructure

$118k - $390k

Tesla

What to Expect

As a Software Engineer within the Autopilot AI Infrastructure team, you will work on reinforcing, optimizing, and scaling our infrastructure components supporting AI research activities for Autopilot and the Optimus.

At the core of our autonomy capabilities are neural networks that the research team is designing to train on very large amounts of data, across large-scale GPU clusters. Robustly training these models at scale and in the shortest amount of timeis critical to our mission.

We are building and improving the in-house distributed training framework used by the research team to train production models, ensuring good ergonomics and flexibility for experimentation while providing good stability and performance.

What You'll Do
  • Write robust Python software code in our machine learning training repository while applying best software practices to support the research team
  • Increase the reliability of our training jobs by debugging and root causing failures across thousands of nodes and implementing fixes to prevent future failures
  • Improve our training framework to support new training paradigms and experimentation methods
  • Build and improve our monitoring/observability infra to quickly debug cluster and training application issues
  • Profile and identifyperformance bottlenecks of training software in our training cluster
  • Coordinate with the supercomputing team managing the training cluster to maintain high availability and job throughput

What You'll Bring
  • Members of the Autopilot AI Infrastructure team are expected to beadaptable tothe dynamic requirements of AI research and capable of contributing across all parts of the AI training software stack
  • Practical programming experience in Python and/or C/C++
  • Experience working with ML training frameworks (ideally PyTorch)
  • Demonstrated experience scaling neural network training jobs across many GPUs
  • Experience with parallel programming concepts and primitives
  • Experience profiling and optimizing CPU-GPU interactions (pipelining computation with data transfers, etc.)
  • Proficient in system-level software, in particular hardware-software interactions and resource utilization
  • Understanding of state-of-the-art deep learning concepts
  • Experience programming in CUDA/Triton and/or NCCL internals

Compensation and Benefits Benefits

Along with competitive pay, as a full-time Tesla employee, you are eligible for the following benefits at day 1 of hire:

  • Medical plans > plan options with $0 payroll deduction
  • Family-building, fertility, adoption and surrogacy benefits
  • Dental (including orthodontic coverage) and vision plans, both have options with a $0 paycheck contribution
  • Company Paid (Health Savings Accounts) HSA Contribution when enrolled in the High-Deductible medical plan with HSA
  • Healthcare and Dependent Care Flexible Spending Accounts (FSA)
  • 401(k) with employer match, Employee Stock Purchase Plans, and other financial benefits
  • Company paid Basic Life, AD&D
  • Short-term and long-term disability insurance (90 day waiting period)
  • Employee Assistance Program
  • Sick and Vacation time (Flex time for salary positions, Accrued hours for Hourly positions), and Paid Holidays
  • Back-up childcare and parenting support resources
  • Voluntary benefits to include: critical illness, hospital indemnity, accident insurance, theft & legal services, and pet insurance
  • Weight Loss and Tobacco Cessation Programs
  • Tesla Babies program
  • Commuter benefits
  • Employee discounts and perks program
Expected Compensation $118,000 - $390,000/annual salary + cash and stock awards + benefits

Pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. The total compensation package for this position may also include other elements dependent on the position offered. Details of participation in these benefit plans will be provided if an employee receives an offer of employment.

Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the Software Engineer, Distributed Training, AI Infrastructure in Palo Alto, CA vacancy
  • $174k - $252k

    Senior Software Engineer, Google Distributed Cloud Hosted, Infrastructure Google Sunnyvale, CA, USA Bachelor’s degree or equivalent practical experience. 5 years of...  ...skills, experience, and relevant education or training. Your recruiter can share more about the specific... 
    Training
    Full time

    Google Inc.

    Sunnyvale, CA
    10 hours ago
  •  ...Software Engineer, AI Compute Infrastructure Los Angeles, Palo Alto, San Francisco, Toronto, Singapore About...  ...the-art AI models—from multimodal training data pipelines to high-throughput,...  ...volume data ingestion/processing, distributed model training, and continuous... 
    Training
    Full time

    HeyGen

    Palo Alto, CA
    2 days ago
  • $160.36k - $240.54k

     ...Software Engineer, ML Infrastructure Mountain View, California (HQ) Nuro is a self-driving technology...  ...driver, combining cutting-edge AI with automotive-grade hardware. Nuro...  ...times, and handle massive-scale distributed training. Data & ETL: Designing robust... 
    Training

    Nuro

    Mountain View, CA
    2 days ago
  • $147k - $211k

    Software Engineer, Google Distributed Cloud Infrastructure Cluster corporate_fare Google place Sunnyvale, CA, USA Bachelor’s degree or equivalent practical experience...  ...skills, experience, and relevant education or training. Your recruiter can share more about the specific... 
    Training
    Full time

    Google Inc.

    Sunnyvale, CA
    4 days ago
  • $164.2k - $205.2k

     ...world's best data and AI infrastructure platform so our...  ...business. Founded by engineers - and customer obsessed...  ...efficiency. As a Senior Software Engineer on the...  ...building large-scale distributed systems ~ Strong proficiency...  ...certifications and training, and specific work... 
    Training
    Local area
    Worldwide

    Databricks

    Mountain View, CA
    2 days ago
  • $160.36k - $240.54k

     ...Software Engineer, ML Data Infrastructure Mountain View, California (HQ) Nuro is a self...  ..., combining cutting-edge AI with automotive-grade...  ...quantity and diversity of its training and evaluation data....  ...working with large-scale distributed data systems Experience... 
    Training
    Work experience placement

    Nuro

    Mountain View, CA
    2 days ago
  • $124k - $420k

     ...What to Expect As a Software Engineer for the Optimus team, you will build the tools and infrastructure to make and measure improvements to neural network architecture, visualize...  ...help us automate the entire workflows of training, validation, and production of the Optimus.... 
    Training
    Hourly pay
    Full time
    Temporary work
    Flexible hours

    Tesla

    Palo Alto, CA
    2 days ago
  • $147k - $211k

    Software Engineer, Infrastructure and Data AI, Ads Platform Google Mountain View, CA, USA Bachelor’s degree in...  ...developing large-scale infrastructure, distributed systems or networks, or...  ...experience, and relevant education or training. Your recruiter can share more about... 
    Training
    Full time
    Local area

    Google Inc.

    Mountain View, CA
    1 day ago
  •  ...delivers state-of-the-art AI compute...  ...and veteran systems engineers with a shared belief: distributed systems powering modern...  ...layer entirely through software. As AI workloads...  ..., traditional infrastructure struggles to meet the...  ...-scale distributed training infrastructure based... 
    Training
    Remote job

    Clockwork Systems, Inc.

    Palo Alto, CA
    more than 2 months ago
  • $166k - $225k

     ...world's best data and AI infrastructure platform so our...  ...business. Founded by engineers — and customer obsessed...  ...query engines. As a software engineer on the Runtime...  ...the next generation distributed data storage and processing...  ...certifications and training, and specific work... 
    Training
    Local area
    Worldwide

    Databricks Inc.

    Mountain View, CA
    2 days ago
  • $140k - $200k

     ...Speechify in a 100% distributed setting – Speechify has...  ...frontend and backend engineers, AI research scientists,...  ...to support our model training operations. We are able...  ...tight integration of infrastructure, engineering, and...  ...looking for a skilled Software Engineer to join us.... 
    Training
    Full time
    Work at office
    Shift work

    Speechify

    Menlo Park, CA
    8 days ago
  • $157k - $235k

     ..., Spectacles. Snap Engineering teams build fun and...  ...in scaling our ML Infrastructure, optimizing training and inference systems...  ...’re looking for a Software Engineer, ML Infrastructure...  ...fast and efficient AI model serving Build...  ...understanding of distributed systems and the... 
    Training
    Full time
    Live in
    Work at office
    Local area

    Snap Inc.

    Palo Alto, CA
    2 hours ago
  •  ...Cloud Infrastructure Engineer At Rhoda AI, we're building the next generation of generalist intelligent robots...  ...the infrastructure that collects training data, keeps our robots running in...  ...language ~ Strong understanding of distributed systems concepts: consistency,... 
    Training

    Rhoda ai

    Palo Alto, CA
    2 days ago
  • $188.5k - $282.7k

     ...solving complex engineering problems for our...  ...Responsibilities: * Software Development:...  ...strong proficiency in distributed systems and data...  ...their data when infrastructure is attacked....  ...compensation and training. The minimum...  ...ACCELERATING THE WORLD'S AI TRANSFORMATION... 
    Training
    Full time
    Local area

    Rubrik Job Board

    Palo Alto, CA
    1 day ago
  • $180k - $240k

     ...solution that integrates advanced software and hardware powering the...  ...are seeking a Senior Cloud Infrastructure Engineer to architect and manage the...  ...will be the backbone of our AI platform, ensuring that multi-GPU clusters, distributed training frameworks, and automated... 
    Training
    Odd job
    Work at office

    Gatik AI

    Mountain View, CA
    10 hours ago
  • $100k

     ...Software Engineer, TT-Distributed Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost...  ...-of-the-art distributed inference and training infrastructure. This role is hybrid, based out of... 
    Training

    Tenstorrent

    Santa Clara, CA
    2 days ago
  • $140k - $390k

     ...What to Expect As a Software Engineer within the Autopilot AI Infrastructure team, you will work on reinforcing, optimizing...  ...research team is designing to train on very large amounts of data,...  ...designing scalable and durable distributed systems Strong knowledge of Python... 
    Training
    Hourly pay
    Full time
    Temporary work
    Casual work
    Flexible hours

    Tesla

    Palo Alto, CA
    4 days ago
  • $140k - $390k

     ...What to Expect As a Software Engineer within the Autopilot AI Infrastructure team, you will work on reinforcing, optimizing, and scaling our infrastructure...  ...neural networks that the research team is designing to train on very large amounts of data, across large-scale... 
    Training
    Hourly pay
    Full time
    Temporary work
    Flexible hours

    Tesla

    Palo Alto, CA
    10 hours ago
  • $181.1k - $318.4k

     ...Sr Software Engineer - AI, Search & Knowledge Platform – Cloud Infrastructure Are you an open-source contributor passionate about...  ...automated infrastructure for ML training and inference at massive...  ..., inference, and large-scale distributed systems. Lead and mentor engineers... 
    Training
    Relocation

    Apple

    Cupertino, CA
    4 days ago
  • $145.1k - $273.2k

     ...hardware logic of various AI accelerators ;...  ...(LLM) inference and training. 2.Operator & Performance...  ...management, and distributed communication. 3.Interconnect...  ...within cloud infrastructure. Who We Look For...  .... degree in Computer Engineering, Electronic Engineering... 
    Training
    Relocation package

    Tencent

    Palo Alto, CA
    1 day ago
  • $184k - $287.5k

     ...Joining NVIDIA's DGX Cloud AI Efficiency Team means contributing to the infrastructure that powers our...  ...of AI workloads - pre-training, post-training,...  ...an AI infrastructure software engineer to join our team. You...  ...scaling large-scale distributed systems. Experience... 
    Training

    NVIDIA

    Santa Clara, CA
    4 days ago
  • $141k - $202k

     ...of experience with software development in C++....  ...developing large-scale infrastructure, distributed systems or networks,...  ..., and software test engineering. About the job The...  ...Production pipeline for both Training and Serving use...  ...stacks. The AI and Infrastructure team... 
    Training
    Full time
    Worldwide

    Google Inc.

    Sunnyvale, CA
    1 day ago
  • $166k - $244k

    Senior Software Engineer, Infrastructure, Google Cloud AI Apply info_outline info_outline X Note: By applying to this...  ...developing large-scale infrastructure, distributed systems or networks, or...  ...experience, and relevant education or training. Your recruiter can share more... 
    Training
    Full time
    Worldwide

    Google Inc.

    Sunnyvale, CA
    1 day ago
  • $141k - $202k

     ...developing large - scale infrastructure, distributed systems or networks, or experience...  ...and resolution, and software test engineering. About the job Google's...  ...software solutions. The AI and Infrastructure team...  ...relevant education or training. Your recruiter can share... 
    Training
    Full time
    Temporary work
    Worldwide

    Google Inc.

    Sunnyvale, CA
    4 days ago
  • $188.5k - $282.7k

     ...The Team The Rubrik Engineering team is comprised of...  ...About Role: Rubrik Software Engineers are self-starters...  .... We connect our distributed SaaS products, and federated...  ...relevant education or training. US Pay Range $18...  ...the World's AI Transformation Rubrik... 
    Training
    Local area

    Rubrik

    Palo Alto, CA
    4 days ago
  • $153k - $222k

     ...future of physical AI. Founded in 2017...  ...the digital infrastructure needed to bring intelligence...  ...infrastructure engineers with expertise in...  ...generation, training frameworks, compute...  ...and implement distributed cloud GPU training...  ...Computer Science, Software Engineering, or... 
    Training
    Full time
    For contractors
    For subcontractor
    Casual work
    Work at office
    Remote work
    Day shift

    Applied Intuition

    Sunnyvale, CA
    10 hours ago
  •  ...About the Role As a Data Infrastructure Engineer in Research at Luma, you...  ...our cutting-edge multimodal AI systems. Your work will...  ...requires a strong foundation in distributed systems and data...  ...vision. So, we are working on training and scaling up multimodal foundation... 
    Training

    Luma AI

    Redwood City, CA
    2 days ago
  • $115k - $210k

     ...on our kiosks and our AI rings up their entire order...  ...looking for a backend infrastructure developer to help us build the software that runs our kiosks...  ...maintain a flat, high-impact engineering culture where every...  ...to perform scalable training in the cloud Rethinking... 
    Training
    Temporary work
    Work experience placement
    Work at office
    Immediate start
    Flexible hours

    Mashgin Inc

    Palo Alto, CA
    4 days ago
  • $213k - $263k

     ...Platform team, builds tools and infrastructure to realize the ML flywheel...  ...the data needs, data distributions, data quality, data value,...  ...in the field of software engineering ~ Experience programming...  ...location, experience, relevant training and education, and skill level... 
    Training
    Full time
    Remote work

    Waymo

    Mountain View, CA
    4 days ago
  • $174k - $252k

    Senior Software Engineer, AI/ML, AI and Infrastructure Apply X Note: By applying to this position you will have an opportunity to share your preferred working...  ...skills, experience, and relevant education or training. Responsibilities Write and test product or system... 
    Training
    Full time
    Worldwide

    Google Inc.

    Mountain View, CA
    10 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Software Engineer, Distributed Training, AI Infrastructure. Be the first to apply!