AI Infra & Cluster Engineer — Scale GPU/CPU Orchestration

Linuxcareers

Linuxcareers is seeking an Infrastructure/Cluster Engineer to design and operate large-scale clusters that enable AI inference at scale. The role focuses on managing diverse hardware architectures and building robust infrastructure. The ideal candidate will possess deep expertise in Linux systems, automation tools, and orchestration technologies. Responsibilities include debugging performance issues and designing observability systems for cluster health. Experience with GPU infrastructure is a plus. #J-18808-Ljbffr Linuxcareers

Apply

Vacancy posted 2 days ago

Similar jobs that could be interesting for youBased on the AI Infra & Cluster Engineer — Scale GPU/CPU Orchestration in San Francisco, CA vacancy

GPU Infra Engineer: Scale Massive Clusters & Observability
A cutting-edge tech company in San Francisco seeks infrastructure engineers to enhance the tooling and systems that power its AI applications. Responsibilities include building GPU orchestration, scaling cloud batchjob systems, and designing efficient scheduling software...
Suggested
Visa sponsorship
Exa Corporation
San Francisco, CA
1 day ago
Founding Engineer: Scale AI Infra & Orchestration Equity
$300k
Albert Bow is seeking a Founding Engineer to design and scale their distributed systems for autonomous AI agents. With a salary of up to $300,000 and equity, you will have the opportunity to join an experienced founding team at a rapidly growing venture-backed AI startup...
Suggested
Albert Bow
San Francisco, CA
4 days ago
Senior HPC & GPU Infra Engineer — Build Frontier AI
Sciforium is looking for a Senior HPC & GPU Infrastructure Engineer to oversee our GPU compute cluster’s health, reliability, and performance. This role involves hands-on Linux systems engineering, GPU driver management, and maintaining machine learning software stacks...
Suggested
Flexible hours
Sciforium
San Francisco, CA
1 day ago
Senior ML Training Systems Engineer - Distributed GPU Infra
A leading AI technology company in San Francisco is looking for a Senior Software Engineer to build scalable infrastructure for large‑scale training and fine-tuning of foundation models. You will design... ...training systems and optimize GPU utilization while collaborating with...
Suggested
Baseten
San Francisco, CA
1 day ago
GPU Kernel Engineer: Build Fast AI Inference at Scale
...A leading AI acceleration company in San Francisco is seeking a GPU Kernel Engineer to optimize performance for machine learning models. You will be responsible for designing high-performance GPU kernels and using advanced techniques to boost computation efficiency. Ideal...
Suggested
Baseten
San Francisco, CA
1 day ago
Multimodal Inference Engineer — Scale GPU AI Models
...innovative company is seeking a talented software engineer to join their dynamic Inference team. This... ...and implementing infrastructure for large-scale multimodal models, focusing on high-... ...product teams to push the boundaries of AI technology, ensuring reliable production services...
OpenAI
San Francisco, CA
17 hours ago
Senior Infra Engineer: Scale Core Platform for AI
Nooks in San Francisco is seeking a Senior Engineer to build infrastructure that enhances the efficiency of multiple product teams. The... ...engineering experience, particularly in distributed systems and scaling production environments. Candidates should be comfortable...
Work at office
3 days per week
Nooks
San Francisco, CA
1 day ago
Infra Engineer: Kubernetes, AI Scale, Equity
A high-growth AI startup in San Francisco is seeking a Software Engineer (Infrastructure) to design and scale Kubernetes systems for a rapidly expanding platform. You will be responsible for leading technical deployments for enterprise clients and developing secure execution...
Jack & Jill/External ATS
San Francisco, CA
3 days ago
Senior Site Reliability Engineer (GPU Clusters) - Hosting
$250k
...opportunities? Join a rapidly scaling AI cloud infrastructure provider building a next-generation GPU platform designed for AI... ...Senior / Staff Site Reliability Engineer to support and scale large-scale... ...monitoring frameworks for GPU compute clusters Collaborate with ML, data,...
Permanent employment
Remote work
San Francisco, CA
27 days ago
AI Infra Systems Engineer Intern — GPU & Cloud Scaling
$190k - $270k
AI Chopping Block, Inc. is seeking an experienced AI Infrastructure Engineer to manage user-facing services and production systems. The role encompasses participating in on-call rotations, building infrastructure with tools like Ansible, Terraform, and Kubernetes, and...
Full time
Internship
AI Chopping Block, Inc.
San Francisco, CA
4 days ago
Senior GPU Infra Engineer — AI Fleet Automation
$180k - $250k
A tech innovation company is looking for a hands-on engineer in San Francisco to manage a vast fleet of GPU servers. You will build systems for tracking server lifecycle, automate provisioning and health checks, and ensure OS-level security. The role requires 5+ years of...
Fal
San Francisco, CA
1 day ago
Software Engineer, Distributed Systems
$180k - $250k
...next generation of AI products. We build... ...production, and do it at scale without compromise.... ...inference, orchestration, and observability... ...experienced software engineer who thrives on building... ..., scheduling, GPU autoscaling, large... ...and tune low level CPU and memory performance...
Currently hiring
Remote work
Relocation package
Fal
San Francisco, CA
1 day ago
Site Reliability Engineer (SRE)
$170k - $230k
...Site Reliability Engineer (SRE) Palo Alto... ...Mithril is an AI infrastructure platform... ...platform built to make GPU compute more... ...shape how Mithril scales its platform across... ...Mithril's global GPU orchestration platform. This is... ...managing clusters, deployments, and...
Work at office
Local area
1 day per week
Mithril
San Francisco, CA
3 days ago
Go Engineer - Neki Orchestration & Scale
$120k - $290k
Somi AI is looking for a Software Engineer to join their team in San Francisco. In this role, you will design and build systems that provision and scale Neki clusters, ensuring high availability and data protection. The ideal candidate will have 5+ years of software engineering...
Somi AI
San Francisco, CA
4 days ago
Hyperscale Cluster Infra Engineer - Kubernetes & Bare-Metal
A leading AI research company in San Francisco is seeking engineers to operate next-gen compute clusters. The role requires scaling Kubernetes, automating infrastructure, and ensuring system reliability. Ideal candidates have strong Kubernetes and scripting skills with...
Slope
San Francisco, CA
4 days ago
Senior GPU Compute Infra Engineer (Remote US)
$200k - $400k
Inferact is seeking a dedicated cluster administration engineer to manage high-performance GPU compute infrastructure in San Francisco. This hands-on role focuses on optimizing system health and availability for engineering productivity. Ideal candidates will have substantial...
Remote job
Inferact
San Francisco, CA
7 hours ago
RL Infra Engineer: Scale GPU RL Experiments (Equity)
$300k
Aionia Group in San Francisco is seeking a Systems Infrastructure Engineer to build scalable infrastructure for RL experiments. This role... ...on innovative projects with leading researchers in a well-funded AI company. The ideal candidate has over 2 years of experience in...
Aionia Group
San Francisco, CA
4 days ago
Software Engineer — GPU Networking & Distributed Systems
...the world's most dynamic AI companies, like Cursor,... ...build the platform engineers turn to to ship AI products... ...multi‑modal workloads scale, the network is the... ...engineers to lead our GPU Networking efforts, making... ...performance on bleeding‑edge clusters (H100/H200, B200/B300,...
Flexible hours
Baseten
San Francisco, CA
1 day ago
Senior Site Reliability Engineer (SRE) - AI Inftastructure
$300k
...building out their AI and cloud platform... ..., full-scale model training, or... ...inference. As a Platform Engineer/Senior Site Reliability... ...of this GPU-powered infrastructure... ...ensuring seamless orchestration across environments... ...of the largest GPU clusters in private deployment...
Hamilton Barnes Associates Limited
San Francisco, CA
1 day ago
Senior Site Reliability Engineer AI Infrastructure
Senior Site Reliability Engineer - AI Infrastructure... ...About Andromeda Andromeda Cluster was founded by Nat Friedman... ...access to the kind of scaled AI infrastructure once... ...systems, network, and orchestration layer that makes the... ...and debug large‑scale GPU infrastructure used...
Full time
Remote work
Cortes 23
San Francisco, CA
4 days ago
Site Reliability Engineer - AI Infrastructure
Site Reliability Engineer - AI Infrastructure Location: Global... ...Andromeda Andromeda Cluster was founded by Nat... ...access to the kind of scaled AI infrastructure once... ...systems, network, and orchestration layer that makes the world... .../AI infrastructure or GPU-based systems (CUDA,...
Full time
Remote work
Andromeda Cluster
San Francisco, CA
1 day ago
Senior Staff Data Center Operations Engineer, GPU Hardware Architecture
$179k - $218k
...the only vertically integrated AI infrastructure company built... ...urgency, who believe in the scale of our ambition and thrive on... ...Staff Data Center Operations Engineer, GPU Hardware Architecture to be the... ...needed to maintain peak cluster health. The Strategic Bridge...
Temporary work
Crusoe
San Francisco, CA
3 days ago
HPC Systems Engineer: Scale Linux Clusters & Automation
Mistral in San Francisco is seeking a Systems Engineer/System Administrator to manage and scale its AI infrastructure. This hybrid role demands skills in Linux... ...in systems administration and experience with HPC clusters or cloud infrastructure. Join Mistral for a high-impact...
Mistral
San Francisco, CA
3 days ago
Senior Systems Performance Engineer - AI Infra & Scale
$172.5k - $210k
A cutting-edge AI infrastructure firm located in San Francisco is seeking a Senior Systems Performance Engineer. This role involves leading hardware evaluations and optimizing AI systems for performance. Candidates should have over 5 years of experience, proficiency in...
Epoch Biodesign
San Francisco, CA
3 days ago
Senior Site Reliability Engineer - AI Cloud & GPU Infra
A tech company focused on AI is seeking a Site Reliability Engineer to ensure the reliability and performance of its GPU marketplace. This role involves maintaining service level objectives, managing capacity, and implementing secure systems. The ideal candidate has strong...
Hyperbolic Labs
San Francisco, CA
1 day ago
Senior AI/ML Infra & SRE Engineer
Senior Infrastructure Engineer - Bland As a Senior Infrastructure... ...anticipating and solving scaling challenges related to... ...industries. Lead - AI/ML Stack Infrastructure Lead... ...operating production Kubernetes clusters optimized for AI/ML workloads with GPU support, implementing...
Temporary work
AI Chopping Block, Inc.
San Francisco, CA
1 day ago
Sales Engineer - AI infrastructure
$300k
...Join a seed-stage AI infrastructure company building large-scale training and inference platforms... ...with a single managed GPU cluster that reached capacity... ..., networking, and orchestration. You lead technical... ...with both executives and engineers, and help create a repeatable...
Permanent employment
Immediate start
San Francisco, CA
more than 2 months ago
ML Systems Engineer — Scale AI for Science (Remote)
$250k - $400k
A leading AI research firm in San Francisco seeks experienced professionals to build and scale systems for AI-driven scientific discovery. The role involves developing... ...base plus equity, with opportunities for ML Engineers, ML Infra, Research Engineers, and Research...
Remote job
Trades Workforce Solutions
San Francisco, CA
1 day ago
Systems Engineer - Network & Storage Infra (Hybrid)
$335k
OpenAI in San Francisco seeks a System Engineer to architect and operationalize essential infrastructure for AI systems. The role demands 7+ years in systems engineering... ...experience debugging and a solid grasp of clustering and scaling in production environments. Offers a hybrid...
Relocation package
OpenAI
San Francisco, CA
1 day ago
Software Engineer - Systems
...history. When people finance GPU clusters, the datacenters housing... ...to the market? Otherwise, as AI scales, compute only becomes available... ...metal servers with our VM orchestration software all the way to coordinating... ...assembly Understanding of CPU interrupts Networking...
Long term contract
Contract work
Fixed term contract
Work at office
Local area
Visa sponsorship
Shift work
The San Francisco Compute Company
San Francisco, CA
1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to AI Infra & Cluster Engineer — Scale GPU/CPU Orchestration. Be the first to apply!