Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Software Engineer I - AI/ML, AWS Neuron Distributed Training

$127.1k - $185k

Amazon

Annapurna Labs designs silicon and software that accelerates innovation. Our custom chips, accelerators, and software stacks enable us to tackle unprecedented technical challenges and deliver solutions that help customers change the world. AWS Neuron is the complete software stack powering AWS Trainium (Trn2/Trn3), our cloud scale Machine Learning accelerators and we are seeking a Senior Software Engineer to join our ML Distributed Training team.

In this role, you will be responsible for the development, enablement, and performance optimization of large scale ML model training across diverse model families. This includes massive scale pre-training and post-training of LLMs with Dense and Mixture-of-Experts architectures, Multimodal models that are transformer and diffusion based, and Reinforcement Learning workloads. You will work at the intersection of ML research and high performance systems, collaborating closely with chip architects, compiler engineers, runtime engineers and AWS solution architects to deliver cost-effective, performant machine learning solutions on AWS Trainium based systems.

Key job responsibilities

You will contribute to the design and implementation of distributed training solutions for large-scale ML models running on Trainium instances. A significant part of your work will involve extending and optimizing popular distributed training frameworks including FSDP, torchtitan, and Hugging Face libraries for the Neuron ecosystem.

A core focus of this role involves developing and optimizing mixed-precision and low-precision training techniques. You will work with BF16, FP8, and emerging numerical formats to improve training throughput while maintaining model accuracy and convergence quality. This includes implementing precision-aware training strategies, loss scaling techniques, and careful gradient management to ensure training stability across reduced precision formats.

Beyond precision optimization, you will profile, analyze, and tune end-to-end training pipelines to achieve optimal performance on Trainium hardware. You will partner with hardware, compiler, and runtime teams to understand system constraints and unlock new capabilities. Additionally, you will collaborate with AWS solution architects and customers to support the deployment and optimization of training workloads at scale.

About the team

Annapurna Labs was a startup company acquired by AWS in 2015, and is now fully integrated. If AWS is an infrastructure company, then think Annapurna Labs as the infrastructure provider of AWS. Our org covers multiple disciplines including silicon engineering, hardware design and verification, software, and operations. AWS Nitro, ENA, EFA, Graviton and F1 EC2 Instances, AWS Neuron, Inferentia and Trainium ML Accelerators, and in storage with scalable NVMe, are some of the products we have delivered, over the last few years.

BASIC QUALIFICATIONS

- Bachelor's degree or above in computer science, computer engineering, or related field, or Bachelor's degree

- 1+ years of programming experience with at least one software programming language (including academic projects, internships, or research)

- Experience with software development practices including code reviews, source control, testing, and build processes

- Experience with machine learning concepts and at least one ML framework (PyTorch, JAX, or TensorFlow)

PREFERRED QUALIFICATIONS

- Master's degree or above in computer science or equivalent

- Experience with large-scale distributed training or LLM workloads

- Experience with computer architecture or hardware-software co-optimization

- Experience with distributed systems, libraries, or frameworks

- Familiarity with end-to-end model training pipelines

- Previous internship or research experience in ML infrastructure or systems software

Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.

Los Angeles County applicants: Job duties for this position include: work safely and cooperatively with other employees, supervisors, and staff; adhere to standards of excellence despite stressful conditions; communicate effectively and respectfully with employees, supervisors, and staff to ensure exceptional customer service; and follow all federal, state, and local laws and Company policies. Criminal history may have a direct, adverse, and negative relationship with some of the material job duties of this position. These include the duties and responsibilities listed above, as well as the abilities to adhere to company policies, exercise sound judgment, effectively manage stress and work safely and respectfully with others, exhibit trustworthiness and professionalism, and safeguard business operations and the Company's reputation. Pursuant to the Los Angeles County Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.

Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit for more information. If the country/region you're applying in isn't listed, please contact your Recruiting Partner.

The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience, qualifications, and location. Amazon also offers comprehensive benefits including health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage), 401(k) matching, paid time off, and parental leave. Learn more about our benefits at

USA, CA, Cupertino - 127,100.00 - 185,000.00 USD annually
Vacancy posted 12 hours ago
Similar jobs that could be interesting for youBased on the Software Engineer I - AI/ML, AWS Neuron Distributed Training in United States vacancy
  • $168.1k - $227.4k

     ...Labs designs silicon and software that accelerates innovation...  ...customers change the world. AWS Neuron is the complete software...  ...role is for a software engineer in the Distributed Training team for AWS Neuron. This...  ...tuning of a wide variety of ML model families, including... 
    Amazon Web Service
    Training
    Internship
    Remote work
    Flexible hours

    Amazon

    United States
    4 days ago
  • $143.7k - $194.4k

     ...Amazon Web Services (AWS) builds AWS Neuron, the software development kit...  ...'s Trainium ML accelerators. This...  ...inference and training performance....  ...software boundary, our engineers build systematic...  ...'s possible in AI acceleration....  ...computing, and distributed architectures, where... 
    Amazon Web Service
    Training
    Internship
    Flexible hours

    Amazon

    Seattle, WA
    1 day ago
  • $184.9k - $250.2k

     ...silicon and software for our Amazon...  ...most talented engineers. Our team...  ...with Amazon Neuron, Inferentia and Trainium ML chips, in networking...  ...such as AWS Nitro, Enhanced...  ...-in-class ML training performance at...  ...and applying AI agents to...  ...analytics, and distributed systems. You'... 
    Amazon Web Service
    Training
    Internship
    Flexible hours

    Amazon

    New York, NY
    12 hours ago
  • $168.1k - $227.4k

     ...Description AWS Neuron is the complete software stack for the AWS Inferentia and Trainium...  ...is for a senior software engineer in the Machine Learning...  ...models, their architecture, training and inference lifecycles...  ...GPUs, Neuron, TPU or other AI acceleration hardware.... 
    Amazon Web Service
    Training
    Work experience placement
    Flexible hours

    Amazon

    Seattle, WA
    1 day ago
  • $193.3k - $261.5k

     ...AWS Neuron is the software stack powering AWS Inferentia and Trainium machine learning...  ...a Software Development Engineer to lead and architect our...  ...on large-scale generative AI applications. Key job responsibilities...  ...and lead the design of distributed ML serving systems optimized... 
    Amazon Web Service
    Internship
    Local area
    Flexible hours

    Amazon

    Cupertino, CA
    1 day ago
  • $165.2k - $223.6k

     ...Description AWS Neuron is the complete software stack for the AWS Inferentia...  ...Development Engineer for the Neuron Foundation...  ...applications and AI accelerators. You...  ...Improving performance of ML Kernels and ML...  ...massive-scale distributed training and inference solutions... 
    Amazon Web Service
    Training
    Internship
    Local area
    Work from home
    Flexible hours

    Amazon

    Seattle, WA
    2 days ago
  •  ...Software Engineer III Join a team where you can play a...  ...serving and agentic AI platforms on AWS. Job responsibilities...  ...Collaborate with ML engineers and senior...  ...skills Formal training or certification on...  ...architecture patterns and distributed systems Exposure... 
    Amazon Web Service
    Training

    Chase

    Columbus, OH
    4 days ago
  •  ...Software Engineer II Disney Entertainment and ESPN...  ...build intelligent, AI driven systems that...  ...experience in AI/ML engineering, with...  ...Traditional ML: Developing, training, or fine-tuning...  ...deployments (AWS/EKS). Strong...  ...data platforms and distributed data processing tools... 
    Amazon Web Service
    Training

    The Walt Disney Studios

    New York, NY
    1 day ago
  •  ...Agentic AI / ML Software Engineer Jersey City NJ and Plano TX...  ...Cloud platform experience - AWS/Azure Excellent comm skills...  ...Proven experience designing, training, and deploying large-scale...  ...platforms (AWS, Azure, GCP) and distributed systems (Kubernetes, Ray, Slurm... 
    Amazon Web Service
    Training
    Local area

    3B Staffing LLC

    Murphy, TX
    1 day ago
  • $242.1k - $327.5k

     ...revolution by applying AI to AI. You...  ...adoption of Neuron, the software stack...  ...space critical to AWS's Generative AI...  ...migration to AWS's ML silicon....  ...with scientists, engineers, product...  ...architecture, model training, neural...  ...compiler tuning - Distributed inference and... 
    Amazon Web Service
    Training
    Flexible hours

    Amazon

    New York, NY
    3 days ago
  • $140k - $170k

     ...AI/Python Software Engineer (Onsite, Austin TX or Sunnyvale CA) Client...  ...and implement AI/ML models leveraging frameworks...  ...engineering, model training, and deployment....  ...Leverage Ray for distributed training and parallel...  ...on cloud platforms (AWS, Azure, GCP) and apply... 
    Amazon Web Service
    Training

    Futran Tech Solutions Pvt. Ltd.

    Sunnyvale, CA
    4 days ago
  •  ...looking for a passionate AI Software Engineer to join our growing...  ...maintain production ML APIs and data pipelines...  ...a dataset, spot distribution problems, and reason about...  ...environments such as AWS or GCP) What We Offer...  ...programs, leadership training, access to conferences... 
    Amazon Web Service
    Training
    Immediate start
    Remote work

    Numerator

    United States
    1 day ago
  • $120k - $140k

     ...unlocks the value of AI where it matters...  ...used by global distributed enterprises to rapidly...  ...entry level Software Engineer who sits at the intersection...  ...ONNX, GenAI, and ML models -...  ...deep learning, model training and inference, and...  ...Familiarity with AWS or Azure tooling.... 
    Amazon Web Service
    Training
    Permanent employment
    Temporary work
    Work at office

    ZEDEDA

    San Jose, CA
    12 hours ago
  • $142k - $153k

     ...national security by engineering scalable...  ...individuals and teams train, develop, and...  ...advanced software systems that support...  ...cutting-edge AI research and...  ...Data Science, AI/ML, or a related technical...  ...Background in distributed systems, high-...  ...platforms (AWS, Azure, GCP) or... 
    Amazon Web Service
    Training
    Full time
    Local area
    Flexible hours

    Aptima

    Fairborn, OH
    4 days ago
  •  ...Software Engineering Role at Salesforce Salesforce is the #1 AI CRM, where humans with agents drive...  ...The AI and ML Infrastructure team...  ...including model training, deployment, inference...  ...scale, highly distributed model systems. This...  ...provider APIs such as AWS, GCP, or Azure... 
    Amazon Web Service
    Training

    Slack

    Seattle, WA
    4 days ago
  • $148.7k - $199.4k

     ...Sr Software Engineer Disney Entertainment and ESPN...  ...advertising, and distribution businesses for years...  ...implementation of intelligent, AI-driven systems...  ...Experience/Skills/Training: ~5+ years of...  ...and monitoring ML systems (PyTorch/...  ...experience with AWS ecosystem and... 
    Amazon Web Service
    Training

    Walt Disney Company

    New York, NY
    12 hours ago
  • $216k - $270k

     ...Scale Generative AI Platform) is an...  ...for a strong engineer to join our team...  ...of software engineering principles...  ...with large-scale distributed systems. You will...  ...with LLMs and ML models. You will...  ...cloud providers (AWS, Azure, GCP)...  ...relevant education or training. Scale... 
    Amazon Web Service
    Training
    Full time

    Scale AI

    San Francisco, CA
    3 days ago
  • $272k - $431.25k

     ...Principal Ai And Ml Infra Software Engineer, Gpu Clusters We are seeking a Principal AI and ML Infra...  ...supervising and improving substantial distributed training operations using PyTorch (DDP, FSDP...  ...cloud computing platforms (e.g., AWS, GCP, Azure) in addition to experience... 
    Amazon Web Service
    Training

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $193.3k - $261.5k

     ...development for ML accelerator...  ...deprecate legacy software and reduce complexity...  ...the team Neuron Containers...  ...customers to run training and inference workloads...  ...or leading an engineering team ~7+...  ...~4+ years of distributed systems experience...  ...with AWS compute services... 
    Amazon Web Service
    Training
    Internship
    Local area
    Flexible hours

    Amazon

    Cupertino, CA
    19 hours ago
  • $120.4k - $198.7k

     ...Software Engineer II For Enterprise Ai Products Value Stream As a member of...  ...skills, education, training, credentials and experience...  ...computing services (AWS/Azure), and...  ...observability tooling or distributed tracing in...  ...Familiarity with ML observability and drift... 
    Amazon Web Service
    Training
    Work experience placement
    Local area

    Travelers

    Saint Paul, MN
    1 day ago
  • $130k - $150k

     ...Senior Software Engineer, Full Stack - AI We are seeking a Full Stack Software...  ...and developing distributed application...  ...engineering or applied ML building real-world...  ...Snowflake, Databricks, AWS and/or Azure). ~ Experience...  ...to education, training, experience, past performance... 
    Amazon Web Service
    Training
    Immediate start

    Fitch Group

    New York, NY
    2 days ago
  • $193.3k - $261.5k

     ...integral part of AWS and develops hardware and software components that are...  .... The AWS Neuron Collectives team...  ...seeking a Software Engineer to optimize collective...  ...the frontier AI models being trained today. Collectives...  ..., firmware, and distributed systems. BASIC... 
    Amazon Web Service
    Training
    Local area
    Work from home
    Flexible hours

    Amazon

    Cupertino, CA
    2 days ago
  • $135k - $155k

     ...Job Title AI Services Software Engineer Job Description The AI...  ...warehouse datasets for training and validation...  ...and implement scalable ML/AI systems and pipeline...  ...infrastructure (e.g. AWS Lambda). ~ Experience...  ...) Experience with distributed computing frameworks... 
    Amazon Web Service
    Training
    Contract work
    Remote work
    Relocation

    Motorola Solutions

    United States
    2 days ago
  • $117.75k - $195k

     ...looking for an exceptional Software Engineer, Agentic AI to help shape the...  ...and implement distributed and cloud-based systems...  ...preprocessing to model training, deployment, and monitoring...  ...(e.g., AWS or GCP) and container...  ...efficiency across the ML infrastructure stack.... 
    Amazon Web Service
    Training
    Work at office
    Remote work
    Flexible hours
    Shift work
    3 days per week

    Robinhood

    Bellevue, WA
    12 hours ago
  •  ...seeking a Research Software Engineer (RSE) to...  ...implementation of AI models, research frameworks...  ...provide teaching and training opportunities...  ...Python experience with ML libraries (...  ...cloud platforms (AWS, GCP, or Azure) and...  ..., evaluation, and distributed computing. Contributions... 
    Amazon Web Service
    Training
    Full time

    The Chronicle Of Higher Education, Inc.

    Newark, IL
    3 days ago
  • $156k - $387.6k

     ...microservices, big data, distributed storage, machine learning training and inference, and...  ...developers to bring AI workloads from...  ...and are looking for engineers passionate about cloud...  ...efficient and secure ML platforms. - Collaborate...  ...cloud providers (AWS, Azure, GCP) and... 
    Amazon Web Service
    Training
    Temporary work
    Local area

    ByteDance

    San Jose, CA
    4 days ago
  •  ...AI Engineer At Allstate, great things happen when our...  ...Python (required) for AI/ML development Java, with...  ...of data pipelines and distributed data processing AI...  ...Face OpenAI, AWS Bedrock, Azure OpenAI...  ...and deployment Model training, evaluation, and deployment... 
    Amazon Web Service
    Training

    Allstate

    Charlotte, NC
    4 days ago
  • $134.96k - $188.95k

     ...enable analytics, data engineering and data science...  ...using innovative software and hardware...  ...We use the latest AWS technologies, big...  ...and LLM to build distributed, highly available...  ...Software Engineer II - AI/ML to join our Metadata...  ...transportation/shipping training. Required for... 
    Amazon Web Service
    Training
    Permanent employment
    Temporary work
    Local area
    Flexible hours

    Blue Origin

    Seattle, WA
    4 days ago
  • $180k - $220k

     ...Artera is an AI startup that...  ...Machine Learning Engineer at Artera,...  ...processing and model training. You’ll work...  ...for Artera’s ML compute...  ...developing distributed training infrastructure...  ...of industry software engineering...  ...with AWS, Docker, and...  ...DeepSpeed, AWS Neuron or similar approaches... 
    Amazon Web Service
    Training
    Permanent employment
    Remote work
    Visa sponsorship
    Work visa

    Artera Corporation

    New York, NY
    19 hours ago
  •  ...Kai is the AI company rebuilding...  ...Our Heads of AI, Engineering, and Product...  ...an experienced software engineer to make...  ...This is not an ML research role....  ...observability - distributed tracing, latency...  ...preferred; AWS/GCP experience...  ...frontier models - not training them but... 
    Amazon Web Service
    Training

    Kai Cyber, Inc.

    San Jose, CA
    19 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Software Engineer I - AI/ML, AWS Neuron Distributed Training. Be the first to apply!