Software Engineer, GPU Infrastructure (HPC)

Cohere

Staff Software Engineer

Our mission is to scale intelligence to serve humanity. We're training and deploying frontier models for developers and enterprises who are building AI systems to power magical experiences like content generation, semantic search, RAG, and agents. We believe that our work is instrumental to the widespread adoption of AI.

The internal infrastructure team is responsible for building world-class infrastructure and tools used to train, evaluate and serve Cohere's foundational models. By joining our team, you will work in close collaboration with AI researchers to support their AI workload needs on the cutting edge, with a strong focus on stability, scalability, and observability. You will be responsible for building and operating superclusters across multiple clouds. Your work will directly accelerate the development of industry-leading AI models that power Cohere's platform North.

Please Note: All of our infrastructure roles require participating in a 24x7 on-call rotation, where you are compensated for your on-call schedule.

As a Staff Software Engineer, you will:

Build and scale ML-optimized HPC infrastructure: Deploy and manage Kubernetes-based GPU/TPU superclusters across multiple clouds, ensuring high throughput and low-latency performance for AI workloads.
Optimize for AI/ML training: Collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance, leveraging technologies like RDMA, NCCL, and high-speed interconnects.
Troubleshoot and resolve complex issues: Proactively identify and resolve infrastructure bottlenecks, performance degradation, and system failures to ensure minimal disruption to AI/ML workflows.
Enable researchers with self-service tools: Design intuitive interfaces and workflows that allow researchers to monitor, debug, and optimize their training jobs independently.
Drive innovation in ML infrastructure: Work closely with AI researchers to understand emerging needs (e.g., JAX, PyTorch, distributed training) and translate them into robust, scalable infrastructure solutions.
Champion best practices: Advocate for observability, automation, and infrastructure-as-code (IaC) across the organization, ensuring systems are maintainable and resilient.
Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence.

You may be a good fit if you have:

Deep expertise in ML/HPC infrastructure: Experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments.
Kubernetes at scale: Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads.
Strong programming skills: Proficiency in Python (for ML tooling) and Go (for systems engineering), with a preference for open-source contributions over reinventing solutions.
Low-level systems knowledge: Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads.
Research collaboration experience: A track record of working closely with AI researchers or ML engineers to solve infrastructure challenges.
Self-directed problem-solving: The ability to identify bottlenecks, propose solutions, and drive impact in a fast-paced environment.

If some of the above doesn't line up perfectly with your experience, we still encourage you to apply!

We value and celebrate diversity and strive to create an inclusive work environment for all. We welcome applicants from all backgrounds and are committed to providing equal opportunities. Should you require any accommodations during the recruitment process, please submit an Accommodations Request Form, and we will work together to meet your needs.

Full-Time Employees at Cohere enjoy these perks:

An open and inclusive culture and work environment
Work closely with a team on the cutting edge of AI research
Weekly lunch stipend, in-office lunches & snacks
Full health and dental benefits, including a separate budget to take care of your mental health
100% Parental Leave top-up for up to 6 months
Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
6 weeks of vacation (30 working days!)

Apply

Vacancy posted 3 days ago

Similar jobs that could be interesting for youBased on the Software Engineer, GPU Infrastructure (HPC) in United States vacancy

Software Engineer, GPU Infrastructure - HPC
$230k
...over unchecked growth. About the role As a software engineer on the Fleet High Performance Computing (HPC) team, you will be responsible for the... ...the health and efficiency of our supercomputing infrastructure. Our team empowers strong engineers with a high...
Suggested
OpenAI
San Francisco, CA
2 days ago
Staff Software Engineer, GPU Infrastructure (HPC)
...team of researchers, engineers, designers, and more,... ...team? The internal infrastructure team is responsible for... ...schedule. As a Staff Software Engineer, you will:... ...and scale ML-optimized HPC infrastructure : Deploy... ...Kubernetes-based GPU/TPU superclusters across...
Suggested
Full time
Work at office
Remote work
Flexible hours
Cohere
Canada, KY
5 days ago
Sr GPU Infrastructure Software Engineer
$165k - $242k
...CoreWeave combines superior infrastructure performance with deep... ...the role Senior engineers are area owners who... ...teams to evolve our GPU performance testing platform... ...in Go and/or Python software development. ~ Hands... ...hardware at scale HPC Experience...
Suggested
Permanent employment
Temporary work
Casual work
Work at office
Remote work
Flexible hours
CoreWeave
Bellevue, WA
5 days ago
Principal Software Developer - GPU AI/HPC kernels
...looking for a principal software developer to join our... ...be part of our ROCm GPU-compute mathematical libraries... ...libraries for AI, HPC applications Aid... ...teams and other internal engineering teams PREFERRED... ..., or early validation infrastructure. ~ Applied experience...
Suggested
Advanced Micro Devices , Inc.
Austin, TX
3 days ago
Senior HPC & GPU Infrastructure Engineer
...Senior HPC & GPU Infrastructure Engineer Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary... ...and GPU driver bring-up to maintaining the ML software stack (CUDA/ROCm, PyTorch, JAX, vLLM). If you love squeezing...
Suggested
Flexible hours
Sciforium
San Francisco, CA
4 days ago
Software Engineer, AI Compute Infrastructure
...Software Engineer, AI Compute Infrastructure Los Angeles, Palo Alto, San Francisco, Toronto, Singapore About HeyGen... ...AI-generated video. Optimize GPU Utilization: Design and implement... ...-scale MLOps, AI infrastructure, or HPC systems. ~ Experience with data...
Full time
HeyGen
Palo Alto, CA
1 day ago
Software Engineer / Quantum Simulator Team / HPC Environment
$135k - $160k
...Software Engineer / Quantum Simulator Team / HPC Environment Cambridge, Massachusetts 100% Remote Full Time $135k - $160k A Boston-based startup... ...Candidates must have experience working in high performance, GPU computing or performance optimization environments,...
Permanent employment
Full time
Remote work
Motion Recruitment
United States
3 days ago
GPU Software Engineer
...Job Title : GPU Software Engineer Location: USA(Remote) Role Summary We are... ...Exposure to high-performance computing (HPC) workloads • Familiarity with:... ...level programming (C/C++) • Contributions to AI infrastructure, HPC, or compiler-level work
Immediate start
Remote work
Futran Tech Solutions Pvt. Ltd.
United States
4 days ago
HPC Scientific Software Engineer (IT@JH Research Computing)
$85.5k - $149.8k
...HPC Scientific Software Engineer ****@*****.*** Research Computing is seeking a HPC Scientific Software Engineer... ...on advanced HPC Systems and related infrastructure. Working primarily within Linux-... ...stacks, containerized applications, and GPU-accelerated workloads using tools...
Remote work
Johns Hopkins University
United States
5 days ago
HPC Infrastructure DevOps Engineer II
$86.32k - $154.96k
...Position Overview St. Jude is seeking an HPC Infrastructure DevOps Engineer II to join the High-Performance... ...and data-intensive workloads • GPU-enabled environments for AI and machine... ...environment support, storage allocation, software availability, job troubleshooting,...
Remote work
St. Jude Children's Research Hospital
Memphis, TN
3 days ago
Senior Software Engineer, Fabric Networking - GPU
$152k - $241.5k
...Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern... ...intelligence. We are looking for highly motivated Senior Software Engineers to work on our GPU Fabric Networking team. Our team develops...
Remote work
NVIDIA
United States
3 days ago
Senior Software Engineer - HPC
$152k - $241.5k
...the next era of computing. An era in which our GPU acts as the brains of computers, robots, and... ...on the world. We are looking for a Senior Software Engineer to join our mission to continue improving our HPC infrastructure. Our team builds and operates sophisticated...
NVIDIA
Santa Clara, CA
3 days ago
Software and Systems Development Programmers
...HPC Software Engineer LOCATION Chantilly, VA 20151 CLEARANCE TS/SCI Full Poly (Please note this position requires full U.S. Citizenship... ...Software Developer, Performance Tuning Specialist, GPU Computing Engineer, Numerical Computing Engineer, HPC Architect...
Temporary work
For contractors
Immediate start
Flexible hours
Cymertek
Chantilly, Loudoun County, VA
4 days ago
HPC Software Engineer
$101k - $151k
...2026 Category Engineering Hire Type Employee... ...Working across CPU and GPU architectures does not... ...workstation or a cloud-based HPC cluster. You think in... ...work on simulation software that shapes how... ...test, and deployment infrastructure What You'll Need...
Local area
Remote work
Worldwide
ANSYS
Canonsburg, PA
3 days ago
HPC Sr. Scientific Software Engineer (IT@JH Research Computing)
$99.8k - $175k
...JH Research Computing is seeking a HPC Sr. Scientific Software Engineer who will design, build, and... ...performance computing and AI research infrastructure. This role integrates elements of both... ...and application workflows across CPU/GPU clusters, parallel storage, and...
Full time
Johns Hopkins University
Baltimore, MD
4 days ago
HPC Software Engineer
...HPC Software Engineer (Location: Northern Colorado Springs, CO) Active Top Secret/SCI with a CI (or Full Scope) Poly to start We are... ...Makefile, autoconf) • Experience with Python and Perl • GPU programming experience (e.g., CUDA, OpenCL) • Agile process...
Contract work
Flexible hours
Apex Systems
Colorado Springs, CO
1 day ago
HPC Cloud Engineer (AWS + Terraform + DevOps)
...HPC Cloud Engineer (AWS + Terraform + DevOps) Location - remote Visa - USC & GC only... ...schedulers (Slurm preferred). Familiarity with GPU computing (NVIDIA drivers, CUDA, NCCL)... ...( Github Action / Code Build) Infrastructure as Code using Terraform / CloudFormation...
Remote work
Damco
United States
7 hours ago
Sr. Staff Software Engineer - HPC Network Engineering
$181k - $297k
..., CA. We are seeking an HPC Network Engineer to design, deploy, and operate... ...fabrics for large-scale GPU clusters. The role focuses... ...systems, GPU, platform, and software teams to build scalable, lossless... ...tools. Experience with infrastructure automation or configuration...
For contractors
Work at office
Flexible hours
LinkedIn
Mountain View, CA
4 days ago
HPC Cloud Performance Engineer
...HPC Cloud Performance Engineer LOCATION Honolulu, HI 96815 CLEARANCE... ...Engineer, HPC Architect, Cloud Infrastructure Engineer, Performance... ...Cloud Systems Engineer, HPC Software Engineer, Cloud Solutions... ...SKILLS Knowledge of GPU acceleration Familiarity...
Temporary work
For contractors
Immediate start
Flexible hours
Cymertek
Honolulu, HI
4 days ago
Senior Software Engineer, Platform Infrastructure
$165k - $225k
...Senior Software Engineer, Platform Infrastructure Moonlite delivers high-performance AI infrastructure for organizations... ...– bare-metal servers, GPU clusters, high-performance storage,... ...orchestration for distributed AI training and HPC workloads. Platform Orchestration...
Immediate start
Remote work
Flexible hours
Moonlite AI
United States
1 day ago
Software Engineer - GPU Networking & Distributed Systems
...applied AI research, flexible infrastructure, and seamless developer... ...and help build the platform engineers turn to to ship AI products.... ...foundational engineers to lead our GPU Networking efforts, making RDMA... ...to architect the software fabric that unifies thousands...
Flexible hours
Baseten
New York, NY
2 days ago
Firmware Application Engineer - Datacenter GPU Platforms
...Together, we advance your career. Senior Gpu Firmware Engineer Firmware Application Engineer -... ...support Gpu deployments across Cloud, Hpc, and Oem segments. You'll work closely... ...bottlenecks, and collaborating across software stacks to deliver optimized, high-performance...
Advanced Micro Devices , Inc.
Santa Clara, CA
2 days ago
High Performance Compute (HPC) Software Engineer - HPC SW Systems
$105.9k - $180k
...expert teams of physicists, engineers, data scientists and problem... ...Key Responsibilities HPC Software Engineering * Design, develop... ...workloads (MPI, multithreading, GPU-accelerated pipelines,... ...failure analysis. Rack & Infrastructure Engineering * Understand rack...
Minimum wage
Work experience placement
Flexible hours
KLA
Ann Arbor, MI
8 hours ago
Principal AI and ML Infra Software Engineer, GPU Clusters
$272k - $431.25k
...Principal Ai And Ml Infra Software Engineer, Gpu Clusters We are seeking a Principal AI and ML... ...Clusters at NVIDIA to join our Hardware Infrastructure team. As an Engineer, you will have a... ...of demonstrated expertise in AI/ML and HPC tasks and systems. ~ Hands-on experience...
NVIDIA
Santa Clara, CA
4 days ago
Engineering Manager, HPC Kubernetes Platform
...high-performance computing (HPC) and cloud infrastructure that supports its clients'... ...industries of tomorrow. Its engineers build critical... ...orchestration layer powering GPU- and CPU-intensive machine-... ...Partner with hardware and software vendors to improve tooling,...
Temporary work
Flexible hours
NorthMark Strategies
Dallas, TX
5 days ago
Senior Embedded Systems & GPU Platform Engineer
...Together, we advance your career. SENIOR GPU FIRMWARE ENGINEER Firmware Application... ...support GPU deployments across Cloud, HPC, and OEM segments. You'll work closely... ...bottlenecks, and collaborating across software stacks to deliver optimized, high-performance...
Advanced Micro Devices , Inc.
Santa Clara, CA
4 days ago
AI Software Engineer: Intelligent Data Infrastructure
$130.9k - $194.7k
...AI Infrastructure Engineer The Mission: Power the Next Generation of AI We are... ...powers AI factories, from GPU clusters running training workloads... ...Mastery: 8+ years of software development experience with... ...high-performance computing (HPC) environments or GPU clusters...
Work at office
Local area
Shift work
3 days per week
NetApp
Pittsburgh, PA
1 day ago
Remote Security Engineer Guard GPU Cloud Infrastructure
...technology firm is seeking a full-time Security Engineer to join their remote team. This role is crucial for safeguarding the innovative GPU cloud platform by identifying... ...offensive and defensive security, strong software development skills, and deep knowledge of...
Full time
Remote work
RunPod, Inc.
New York, NY
2 days ago
DevOps / Platform Engineer (AWS, Terraform, Python, HPC)
...are seeking an experienced DevOps / Platform Engineer with deep expertise in AWS services, Terraform, Python, and HPC infrastructure. This role will work closely with the Chief... ...groups, AWS Batch and both CPU and GPU compute resources Set up monitors and logs...
H1b
EPAM Systems Inc
Houston, TX
2 days ago
System Infrastructure / Platform Engineer, HPC Technology Department
$156.86k - $191.72k
...System Infrastructure / Platform Engineer, HPC Technology Department The National Energy Research Scientific... ...cutting-edge technologies such as CPU/GPU clusters, parallel storage, high-... ...collaboration and mentoring Experience in software engineering, Linux systems...
Permanent employment
Full time
Remote work
Flexible hours
Lawrence Berkeley Lab
Berkeley, CA
2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Software Engineer, GPU Infrastructure (HPC). Be the first to apply!