Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Software Engineer, GPU Infrastructure (HPC)

Cohere

Staff Software Engineer

Our mission is to scale intelligence to serve humanity. We're training and deploying frontier models for developers and enterprises who are building AI systems to power magical experiences like content generation, semantic search, RAG, and agents. We believe that our work is instrumental to the widespread adoption of AI.

The internal infrastructure team is responsible for building world-class infrastructure and tools used to train, evaluate and serve Cohere's foundational models. By joining our team, you will work in close collaboration with AI researchers to support their AI workload needs on the cutting edge, with a strong focus on stability, scalability, and observability. You will be responsible for building and operating superclusters across multiple clouds. Your work will directly accelerate the development of industry-leading AI models that power Cohere's platform North.

Please Note: All of our infrastructure roles require participating in a 24x7 on-call rotation, where you are compensated for your on-call schedule.

As a Staff Software Engineer, you will:

  • Build and scale ML-optimized HPC infrastructure: Deploy and manage Kubernetes-based GPU/TPU superclusters across multiple clouds, ensuring high throughput and low-latency performance for AI workloads.
  • Optimize for AI/ML training: Collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance, leveraging technologies like RDMA, NCCL, and high-speed interconnects.
  • Troubleshoot and resolve complex issues: Proactively identify and resolve infrastructure bottlenecks, performance degradation, and system failures to ensure minimal disruption to AI/ML workflows.
  • Enable researchers with self-service tools: Design intuitive interfaces and workflows that allow researchers to monitor, debug, and optimize their training jobs independently.
  • Drive innovation in ML infrastructure: Work closely with AI researchers to understand emerging needs (e.g., JAX, PyTorch, distributed training) and translate them into robust, scalable infrastructure solutions.
  • Champion best practices: Advocate for observability, automation, and infrastructure-as-code (IaC) across the organization, ensuring systems are maintainable and resilient.
  • Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence.

You may be a good fit if you have:

  • Deep expertise in ML/HPC infrastructure: Experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments.
  • Kubernetes at scale: Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads.
  • Strong programming skills: Proficiency in Python (for ML tooling) and Go (for systems engineering), with a preference for open-source contributions over reinventing solutions.
  • Low-level systems knowledge: Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads.
  • Research collaboration experience: A track record of working closely with AI researchers or ML engineers to solve infrastructure challenges.
  • Self-directed problem-solving: The ability to identify bottlenecks, propose solutions, and drive impact in a fast-paced environment.

If some of the above doesn't line up perfectly with your experience, we still encourage you to apply!

We value and celebrate diversity and strive to create an inclusive work environment for all. We welcome applicants from all backgrounds and are committed to providing equal opportunities. Should you require any accommodations during the recruitment process, please submit an Accommodations Request Form, and we will work together to meet your needs.

Full-Time Employees at Cohere enjoy these perks:

  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)
Vacancy posted 3 days ago
Similar jobs that could be interesting for youBased on the Software Engineer, GPU Infrastructure (HPC) in United States vacancy
  • $230k

     ...over unchecked growth. About the role As a software engineer on the Fleet High Performance Computing (HPC) team, you will be responsible for the...  ...the health and efficiency of our supercomputing infrastructure. Our team empowers strong engineers with a high... 
    Suggested

    OpenAI

    San Francisco, CA
    2 days ago
  •  ...team of researchers, engineers, designers, and more,...  ...team? The internal infrastructure team is responsible for...  ...schedule. As a Staff Software Engineer, you will:...  ...and scale ML-optimized HPC infrastructure : Deploy...  ...Kubernetes-based GPU/TPU superclusters across... 
    Suggested
    Full time
    Work at office
    Remote work
    Flexible hours

    Cohere

    Canada, KY
    5 days ago
  • $165k - $242k

     ...CoreWeave combines superior infrastructure performance with deep...  ...the role Senior engineers are area owners who...  ...teams to evolve our GPU performance testing platform...  ...in Go and/or Python software development. ~ Hands...  ...hardware at scale HPC Experience... 
    Suggested
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Remote work
    Flexible hours

    CoreWeave

    Bellevue, WA
    5 days ago
  •  ...looking for a principal software developer to join our...  ...be part of our ROCm GPU-compute mathematical libraries...  ...libraries for AI, HPC applications Aid...  ...teams and other internal engineering teams PREFERRED...  ..., or early validation infrastructure. ~ Applied experience... 
    Suggested

    Advanced Micro Devices , Inc.

    Austin, TX
    3 days ago
  •  ...Senior HPC & GPU Infrastructure Engineer Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary...  ...and GPU driver bring-up to maintaining the ML software stack (CUDA/ROCm, PyTorch, JAX, vLLM). If you love squeezing... 
    Suggested
    Flexible hours

    Sciforium

    San Francisco, CA
    4 days ago
  •  ...Software Engineer, AI Compute Infrastructure Los Angeles, Palo Alto, San Francisco, Toronto, Singapore About HeyGen...  ...AI-generated video. Optimize GPU Utilization: Design and implement...  ...-scale MLOps, AI infrastructure, or HPC systems. ~ Experience with data... 
    Full time

    HeyGen

    Palo Alto, CA
    1 day ago
  • $135k - $160k

     ...Software Engineer / Quantum Simulator Team / HPC Environment Cambridge, Massachusetts 100% Remote Full Time $135k - $160k A Boston-based startup...  ...Candidates must have experience working in high performance, GPU computing or performance optimization environments,... 
    Permanent employment
    Full time
    Remote work

    Motion Recruitment

    United States
    3 days ago
  •  ...Job Title : GPU Software Engineer Location: USA(Remote) Role Summary We are...  ...Exposure to high-performance computing (HPC) workloads • Familiarity with:...  ...level programming (C/C++) • Contributions to AI infrastructure, HPC, or compiler-level work
    Immediate start
    Remote work

    Futran Tech Solutions Pvt. Ltd.

    United States
    4 days ago
  • $85.5k - $149.8k

     ...HPC Scientific Software Engineer ****@*****.*** Research Computing is seeking a HPC Scientific Software Engineer...  ...on advanced HPC Systems and related infrastructure. Working primarily within Linux-...  ...stacks, containerized applications, and GPU-accelerated workloads using tools... 
    Remote work

    Johns Hopkins University

    United States
    5 days ago
  • $86.32k - $154.96k

     ...Position Overview St. Jude is seeking an HPC Infrastructure DevOps Engineer II to join the High-Performance...  ...and data-intensive workloads • GPU-enabled environments for AI and machine...  ...environment support, storage allocation, software availability, job troubleshooting,... 
    Remote work

    St. Jude Children's Research Hospital

    Memphis, TN
    3 days ago
  • $152k - $241.5k

     ...Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern...  ...intelligence. We are looking for highly motivated Senior Software Engineers to work on our GPU Fabric Networking team. Our team develops... 
    Remote work

    NVIDIA

    United States
    3 days ago
  • $152k - $241.5k

     ...the next era of computing. An era in which our GPU acts as the brains of computers, robots, and...  ...on the world. We are looking for a Senior Software Engineer to join our mission to continue improving our HPC infrastructure. Our team builds and operates sophisticated... 

    NVIDIA

    Santa Clara, CA
    3 days ago
  •  ...HPC Software Engineer LOCATION Chantilly, VA 20151 CLEARANCE TS/SCI Full Poly (Please note this position requires full U.S. Citizenship...  ...Software Developer, Performance Tuning Specialist, GPU Computing Engineer, Numerical Computing Engineer, HPC Architect... 
    Temporary work
    For contractors
    Immediate start
    Flexible hours

    Cymertek

    Chantilly, Loudoun County, VA
    4 days ago
  • $101k - $151k

     ...2026 Category Engineering Hire Type Employee...  ...Working across CPU and GPU architectures does not...  ...workstation or a cloud-based HPC cluster. You think in...  ...work on simulation software that shapes how...  ...test, and deployment infrastructure What You'll Need... 
    Local area
    Remote work
    Worldwide

    ANSYS

    Canonsburg, PA
    3 days ago
  • $99.8k - $175k

     ...JH Research Computing is seeking a HPC Sr. Scientific Software Engineer who will design, build, and...  ...performance computing and AI research infrastructure. This role integrates elements of both...  ...and application workflows across CPU/GPU clusters, parallel storage, and... 
    Full time

    Johns Hopkins University

    Baltimore, MD
    4 days ago
  •  ...HPC Software Engineer (Location: Northern Colorado Springs, CO) Active Top Secret/SCI with a CI (or Full Scope) Poly to start We are...  ...Makefile, autoconf) • Experience with Python and Perl • GPU programming experience (e.g., CUDA, OpenCL) • Agile process... 
    Contract work
    Flexible hours

    Apex Systems

    Colorado Springs, CO
    1 day ago
  •  ...HPC Cloud Engineer (AWS + Terraform + DevOps) Location - remote Visa - USC & GC only...  ...schedulers (Slurm preferred). Familiarity with GPU computing (NVIDIA drivers, CUDA, NCCL)...  ...( Github Action / Code Build) Infrastructure as Code using Terraform / CloudFormation... 
    Remote work

    Damco

    United States
    7 hours ago
  • $181k - $297k

     ..., CA. We are seeking an HPC Network Engineer to design, deploy, and operate...  ...fabrics for large-scale GPU clusters. The role focuses...  ...systems, GPU, platform, and software teams to build scalable, lossless...  ...tools. Experience with infrastructure automation or configuration... 
    For contractors
    Work at office
    Flexible hours

    LinkedIn

    Mountain View, CA
    4 days ago
  •  ...HPC Cloud Performance Engineer LOCATION Honolulu, HI 96815 CLEARANCE...  ...Engineer, HPC Architect, Cloud Infrastructure Engineer, Performance...  ...Cloud Systems Engineer, HPC Software Engineer, Cloud Solutions...  ...SKILLS Knowledge of GPU acceleration Familiarity... 
    Temporary work
    For contractors
    Immediate start
    Flexible hours

    Cymertek

    Honolulu, HI
    4 days ago
  • $165k - $225k

     ...Senior Software Engineer, Platform Infrastructure Moonlite delivers high-performance AI infrastructure for organizations...  ...– bare-metal servers, GPU clusters, high-performance storage,...  ...orchestration for distributed AI training and HPC workloads. Platform Orchestration... 
    Immediate start
    Remote work
    Flexible hours

    Moonlite AI

    United States
    1 day ago
  •  ...applied AI research, flexible infrastructure, and seamless developer...  ...and help build the platform engineers turn to to ship AI products....  ...foundational engineers to lead our GPU Networking efforts, making RDMA...  ...to architect the software fabric that unifies thousands... 
    Flexible hours

    Baseten

    New York, NY
    2 days ago
  •  ...Together, we advance your career. Senior Gpu Firmware Engineer Firmware Application Engineer -...  ...support Gpu deployments across Cloud, Hpc, and Oem segments. You'll work closely...  ...bottlenecks, and collaborating across software stacks to deliver optimized, high-performance... 

    Advanced Micro Devices , Inc.

    Santa Clara, CA
    2 days ago
  • $105.9k - $180k

     ...expert teams of physicists, engineers, data scientists and problem...  ...Key Responsibilities HPC Software Engineering * Design, develop...  ...workloads (MPI, multithreading, GPU-accelerated pipelines,...  ...failure analysis. Rack & Infrastructure Engineering * Understand rack... 
    Minimum wage
    Work experience placement
    Flexible hours

    KLA

    Ann Arbor, MI
    8 hours ago
  • $272k - $431.25k

     ...Principal Ai And Ml Infra Software Engineer, Gpu Clusters We are seeking a Principal AI and ML...  ...Clusters at NVIDIA to join our Hardware Infrastructure team. As an Engineer, you will have a...  ...of demonstrated expertise in AI/ML and HPC tasks and systems. ~ Hands-on experience... 

    NVIDIA

    Santa Clara, CA
    4 days ago
  •  ...high-performance computing (HPC) and cloud infrastructure that supports its clients'...  ...industries of tomorrow. Its engineers build critical...  ...orchestration layer powering GPU- and CPU-intensive machine-...  ...Partner with hardware and software vendors to improve tooling,... 
    Temporary work
    Flexible hours

    NorthMark Strategies

    Dallas, TX
    5 days ago
  •  ...Together, we advance your career. SENIOR GPU FIRMWARE ENGINEER Firmware Application...  ...support GPU deployments across Cloud, HPC, and OEM segments. You'll work closely...  ...bottlenecks, and collaborating across software stacks to deliver optimized, high-performance... 

    Advanced Micro Devices , Inc.

    Santa Clara, CA
    4 days ago
  • $130.9k - $194.7k

     ...AI Infrastructure Engineer The Mission: Power the Next Generation of AI We are...  ...powers AI factories, from GPU clusters running training workloads...  ...Mastery: 8+ years of software development experience with...  ...high-performance computing (HPC) environments or GPU clusters... 
    Work at office
    Local area
    Shift work
    3 days per week

    NetApp

    Pittsburgh, PA
    1 day ago
  •  ...technology firm is seeking a full-time Security Engineer to join their remote team. This role is crucial for safeguarding the innovative GPU cloud platform by identifying...  ...offensive and defensive security, strong software development skills, and deep knowledge of... 
    Full time
    Remote work

    RunPod, Inc.

    New York, NY
    2 days ago
  •  ...are seeking an experienced DevOps / Platform Engineer with deep expertise in AWS services, Terraform, Python, and HPC infrastructure. This role will work closely with the Chief...  ...groups, AWS Batch and both CPU and GPU compute resources Set up monitors and logs... 
    H1b

    EPAM Systems Inc

    Houston, TX
    2 days ago
  • $156.86k - $191.72k

     ...System Infrastructure / Platform Engineer, HPC Technology Department The National Energy Research Scientific...  ...cutting-edge technologies such as CPU/GPU clusters, parallel storage, high-...  ...collaboration and mentoring Experience in software engineering, Linux systems... 
    Permanent employment
    Full time
    Remote work
    Flexible hours

    Lawrence Berkeley Lab

    Berkeley, CA
    2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Software Engineer, GPU Infrastructure (HPC). Be the first to apply!