Software Engineer, GPU Infrastructure (HPC)
Cohere
Staff Software Engineer
Our mission is to scale intelligence to serve humanity. We're training and deploying frontier models for developers and enterprises who are building AI systems to power magical experiences like content generation, semantic search, RAG, and agents. We believe that our work is instrumental to the widespread adoption of AI.
The internal infrastructure team is responsible for building world-class infrastructure and tools used to train, evaluate and serve Cohere's foundational models. By joining our team, you will work in close collaboration with AI researchers to support their AI workload needs on the cutting edge, with a strong focus on stability, scalability, and observability. You will be responsible for building and operating superclusters across multiple clouds. Your work will directly accelerate the development of industry-leading AI models that power Cohere's platform North.
Please Note: All of our infrastructure roles require participating in a 24x7 on-call rotation, where you are compensated for your on-call schedule.
As a Staff Software Engineer, you will:
- Build and scale ML-optimized HPC infrastructure: Deploy and manage Kubernetes-based GPU/TPU superclusters across multiple clouds, ensuring high throughput and low-latency performance for AI workloads.
- Optimize for AI/ML training: Collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance, leveraging technologies like RDMA, NCCL, and high-speed interconnects.
- Troubleshoot and resolve complex issues: Proactively identify and resolve infrastructure bottlenecks, performance degradation, and system failures to ensure minimal disruption to AI/ML workflows.
- Enable researchers with self-service tools: Design intuitive interfaces and workflows that allow researchers to monitor, debug, and optimize their training jobs independently.
- Drive innovation in ML infrastructure: Work closely with AI researchers to understand emerging needs (e.g., JAX, PyTorch, distributed training) and translate them into robust, scalable infrastructure solutions.
- Champion best practices: Advocate for observability, automation, and infrastructure-as-code (IaC) across the organization, ensuring systems are maintainable and resilient.
- Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence.
You may be a good fit if you have:
- Deep expertise in ML/HPC infrastructure: Experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments.
- Kubernetes at scale: Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads.
- Strong programming skills: Proficiency in Python (for ML tooling) and Go (for systems engineering), with a preference for open-source contributions over reinventing solutions.
- Low-level systems knowledge: Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads.
- Research collaboration experience: A track record of working closely with AI researchers or ML engineers to solve infrastructure challenges.
- Self-directed problem-solving: The ability to identify bottlenecks, propose solutions, and drive impact in a fast-paced environment.
If some of the above doesn't line up perfectly with your experience, we still encourage you to apply!
We value and celebrate diversity and strive to create an inclusive work environment for all. We welcome applicants from all backgrounds and are committed to providing equal opportunities. Should you require any accommodations during the recruitment process, please submit an Accommodations Request Form, and we will work together to meet your needs.
Full-Time Employees at Cohere enjoy these perks:
- An open and inclusive culture and work environment
- Work closely with a team on the cutting edge of AI research
- Weekly lunch stipend, in-office lunches & snacks
- Full health and dental benefits, including a separate budget to take care of your mental health
- 100% Parental Leave top-up for up to 6 months
- Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
- Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
- 6 weeks of vacation (30 working days!)
$230k
...over unchecked growth. About the role As a software engineer on the Fleet High Performance Computing (HPC) team, you will be responsible for the... ...the health and efficiency of our supercomputing infrastructure. Our team empowers strong engineers with a high...Suggested- ...team of researchers, engineers, designers, and more,... ...team? The internal infrastructure team is responsible for... ...schedule. As a Staff Software Engineer, you will:... ...and scale ML-optimized HPC infrastructure : Deploy... ...Kubernetes-based GPU/TPU superclusters across...SuggestedFull timeWork at officeRemote workFlexible hours
$165k - $242k
...CoreWeave combines superior infrastructure performance with deep... ...the role Senior engineers are area owners who... ...teams to evolve our GPU performance testing platform... ...in Go and/or Python software development. ~ Hands... ...hardware at scale HPC Experience...SuggestedPermanent employmentTemporary workCasual workWork at officeRemote workFlexible hours- ...looking for a principal software developer to join our... ...be part of our ROCm GPU-compute mathematical libraries... ...libraries for AI, HPC applications Aid... ...teams and other internal engineering teams PREFERRED... ..., or early validation infrastructure. ~ Applied experience...Suggested
- ...Senior HPC & GPU Infrastructure Engineer Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary... ...and GPU driver bring-up to maintaining the ML software stack (CUDA/ROCm, PyTorch, JAX, vLLM). If you love squeezing...SuggestedFlexible hours
- ...Software Engineer, AI Compute Infrastructure Los Angeles, Palo Alto, San Francisco, Toronto, Singapore About HeyGen... ...AI-generated video. Optimize GPU Utilization: Design and implement... ...-scale MLOps, AI infrastructure, or HPC systems. ~ Experience with data...Full time
$135k - $160k
...Software Engineer / Quantum Simulator Team / HPC Environment Cambridge, Massachusetts 100% Remote Full Time $135k - $160k A Boston-based startup... ...Candidates must have experience working in high performance, GPU computing or performance optimization environments,...Permanent employmentFull timeRemote work- ...Job Title : GPU Software Engineer Location: USA(Remote) Role Summary We are... ...Exposure to high-performance computing (HPC) workloads • Familiarity with:... ...level programming (C/C++) • Contributions to AI infrastructure, HPC, or compiler-level workImmediate startRemote work
$85.5k - $149.8k
...HPC Scientific Software Engineer ****@*****.*** Research Computing is seeking a HPC Scientific Software Engineer... ...on advanced HPC Systems and related infrastructure. Working primarily within Linux-... ...stacks, containerized applications, and GPU-accelerated workloads using tools...Remote work$86.32k - $154.96k
...Position Overview St. Jude is seeking an HPC Infrastructure DevOps Engineer II to join the High-Performance... ...and data-intensive workloads • GPU-enabled environments for AI and machine... ...environment support, storage allocation, software availability, job troubleshooting,...Remote work$152k - $241.5k
...Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern... ...intelligence. We are looking for highly motivated Senior Software Engineers to work on our GPU Fabric Networking team. Our team develops...Remote work$152k - $241.5k
...the next era of computing. An era in which our GPU acts as the brains of computers, robots, and... ...on the world. We are looking for a Senior Software Engineer to join our mission to continue improving our HPC infrastructure. Our team builds and operates sophisticated...- ...HPC Software Engineer LOCATION Chantilly, VA 20151 CLEARANCE TS/SCI Full Poly (Please note this position requires full U.S. Citizenship... ...Software Developer, Performance Tuning Specialist, GPU Computing Engineer, Numerical Computing Engineer, HPC Architect...Temporary workFor contractorsImmediate startFlexible hours
$101k - $151k
...2026 Category Engineering Hire Type Employee... ...Working across CPU and GPU architectures does not... ...workstation or a cloud-based HPC cluster. You think in... ...work on simulation software that shapes how... ...test, and deployment infrastructure What You'll Need...Local areaRemote workWorldwide$99.8k - $175k
...JH Research Computing is seeking a HPC Sr. Scientific Software Engineer who will design, build, and... ...performance computing and AI research infrastructure. This role integrates elements of both... ...and application workflows across CPU/GPU clusters, parallel storage, and...Full time- ...HPC Software Engineer (Location: Northern Colorado Springs, CO) Active Top Secret/SCI with a CI (or Full Scope) Poly to start We are... ...Makefile, autoconf) • Experience with Python and Perl • GPU programming experience (e.g., CUDA, OpenCL) • Agile process...Contract workFlexible hours
- ...HPC Cloud Engineer (AWS + Terraform + DevOps) Location - remote Visa - USC & GC only... ...schedulers (Slurm preferred). Familiarity with GPU computing (NVIDIA drivers, CUDA, NCCL)... ...( Github Action / Code Build) Infrastructure as Code using Terraform / CloudFormation...Remote work
$181k - $297k
..., CA. We are seeking an HPC Network Engineer to design, deploy, and operate... ...fabrics for large-scale GPU clusters. The role focuses... ...systems, GPU, platform, and software teams to build scalable, lossless... ...tools. Experience with infrastructure automation or configuration...For contractorsWork at officeFlexible hours- ...HPC Cloud Performance Engineer LOCATION Honolulu, HI 96815 CLEARANCE... ...Engineer, HPC Architect, Cloud Infrastructure Engineer, Performance... ...Cloud Systems Engineer, HPC Software Engineer, Cloud Solutions... ...SKILLS Knowledge of GPU acceleration Familiarity...Temporary workFor contractorsImmediate startFlexible hours
$165k - $225k
...Senior Software Engineer, Platform Infrastructure Moonlite delivers high-performance AI infrastructure for organizations... ...– bare-metal servers, GPU clusters, high-performance storage,... ...orchestration for distributed AI training and HPC workloads. Platform Orchestration...Immediate startRemote workFlexible hours- ...applied AI research, flexible infrastructure, and seamless developer... ...and help build the platform engineers turn to to ship AI products.... ...foundational engineers to lead our GPU Networking efforts, making RDMA... ...to architect the software fabric that unifies thousands...Flexible hours
- ...Together, we advance your career. Senior Gpu Firmware Engineer Firmware Application Engineer -... ...support Gpu deployments across Cloud, Hpc, and Oem segments. You'll work closely... ...bottlenecks, and collaborating across software stacks to deliver optimized, high-performance...
$105.9k - $180k
...expert teams of physicists, engineers, data scientists and problem... ...Key Responsibilities HPC Software Engineering * Design, develop... ...workloads (MPI, multithreading, GPU-accelerated pipelines,... ...failure analysis. Rack & Infrastructure Engineering * Understand rack...Minimum wageWork experience placementFlexible hours$272k - $431.25k
...Principal Ai And Ml Infra Software Engineer, Gpu Clusters We are seeking a Principal AI and ML... ...Clusters at NVIDIA to join our Hardware Infrastructure team. As an Engineer, you will have a... ...of demonstrated expertise in AI/ML and HPC tasks and systems. ~ Hands-on experience...- ...high-performance computing (HPC) and cloud infrastructure that supports its clients'... ...industries of tomorrow. Its engineers build critical... ...orchestration layer powering GPU- and CPU-intensive machine-... ...Partner with hardware and software vendors to improve tooling,...Temporary workFlexible hours
- ...Together, we advance your career. SENIOR GPU FIRMWARE ENGINEER Firmware Application... ...support GPU deployments across Cloud, HPC, and OEM segments. You'll work closely... ...bottlenecks, and collaborating across software stacks to deliver optimized, high-performance...
$130.9k - $194.7k
...AI Infrastructure Engineer The Mission: Power the Next Generation of AI We are... ...powers AI factories, from GPU clusters running training workloads... ...Mastery: 8+ years of software development experience with... ...high-performance computing (HPC) environments or GPU clusters...Work at officeLocal areaShift work3 days per week- ...technology firm is seeking a full-time Security Engineer to join their remote team. This role is crucial for safeguarding the innovative GPU cloud platform by identifying... ...offensive and defensive security, strong software development skills, and deep knowledge of...Full timeRemote work
- ...are seeking an experienced DevOps / Platform Engineer with deep expertise in AWS services, Terraform, Python, and HPC infrastructure. This role will work closely with the Chief... ...groups, AWS Batch and both CPU and GPU compute resources Set up monitors and logs...H1b
$156.86k - $191.72k
...System Infrastructure / Platform Engineer, HPC Technology Department The National Energy Research Scientific... ...cutting-edge technologies such as CPU/GPU clusters, parallel storage, high-... ...collaboration and mentoring Experience in software engineering, Linux systems...Permanent employmentFull timeRemote workFlexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Software Engineer, GPU Infrastructure (HPC). Be the first to apply!
- graduate software developer United States
- rust software engineer United States
- senior software design engineer United States
- software engineer student United States
- software engineer amazon United States
- software developer positions United States
- software engineer full time United States
- software qa engineer United States
- new graduate software engineer United States
- junior software developer United States

