Senior GPU HPC Platform Reliability Engineer

OpenAI

A leading AI research company in San Francisco is seeking a software engineer for its Fleet High Performance Computing team. In this role, you'll ensure the reliability and uptime of the compute fleet, working with automation systems and monitoring tools. Ideal candidates have experience managing server environments and proficiency in languages like Python or Go. Join us to innovate in AI technology while maintaining high system efficiency. #J-18808-Ljbffr OpenAI

Apply

Vacancy posted 3 days ago

Similar jobs that could be interesting for youBased on the Senior GPU HPC Platform Reliability Engineer in San Francisco, CA vacancy

Senior Site Reliability Engineer - ML/HPC Cloud Platforms
...biotechnology firm in South San Francisco is seeking a Site Reliability Engineer to architect and implement Infrastructure as Code (IaC) solutions that enhance cloud-based platform solutions for Machine Learning and HPC workloads. The ideal candidate has extensive experience...
Senior
3 days per week
Genentech
South San Francisco, CA
2 days ago
Senior Site Reliability Engineer - AI Cloud & GPU Infra
A tech company focused on AI is seeking a Site Reliability Engineer to ensure the reliability and performance of its GPU marketplace. This role involves maintaining service level objectives, managing capacity, and implementing secure systems. The ideal candidate has strong...
Senior
Hyperbolic Labs
San Francisco, CA
4 days ago
Senior Site Reliability Engineer (GPU Clusters) - Hosting
$250k
...infrastructure provider building a next-generation GPU platform designed for AI training, experimentation,... ...States. The company is looking for a Senior / Staff Site Reliability Engineer to support and scale large-scale HPC and cloud environments powering GPU-...
Senior
Permanent employment
Remote work
San Francisco, CA
19 days ago
Senior GPU/HPC Infra Engineer — High-Perf AI Cluster
Sciforium, an AI infrastructure company in San Francisco, is looking for a Senior HPC & GPU Infrastructure Engineer to manage the health and performance of our GPU compute cluster. You will be the primary custodian of a high-density accelerator environment, bridging hardware...
Senior
Flexible hours
Sciforium
San Francisco, CA
3 days ago
Senior SRE & Platform Engineer for AI-Driven Ops
$163k - $203k
GoTo Meeting is looking for a Senior Site Reliability Engineer in San Francisco. You will be responsible for the reliability, scalability, and security of Prosper’s Cloud Platform portfolio. This role requires expertise in Kubernetes, cloud platforms (preferably GCP), and...
Senior
GoTo Meeting
San Francisco, CA
2 days ago
Senior SRE Platform Engineer for AI-Powered Code Review
An innovative R&D company in San Francisco is seeking a Site Reliability Engineer to join its Platform Engineering team. This position focuses on ensuring the reliability and performance of an AI-powered code review platform. The ideal candidate will have 6-8 years of experience...
Senior
CodeRabbit
San Francisco, CA
5 days ago
Senior Platform Reliability Engineer
$200k - $250k
A leading visual creation platform in San Francisco is seeking a Senior Owner of Stability and Infrastructure. This hands-on technical leadership role demands expertise in service reliability to ensure the platform's performance as it scales. Responsibilities include setting...
Senior
Vizcom
San Francisco, CA
2 days ago
Senior Platform & Reliability Engineer — AI-Native Scale
OpenArt AI in San Francisco is seeking a Senior Platform & Reliability Engineer to design and improve the reliability of its infrastructure. The role emphasizes building and operating production systems while collaborating with product engineers to ensure platform scalability...
Senior
OpenArt AI
San Francisco, CA
1 day ago
Senior / Staff Site Reliability, Platform Engineering
...identity security, delivering an AI-powered platform that governs and secures access to... ...cloud-native systems. As a Staff Platform Engineer, you will play a critical role in ensuring... ...technical leadership role. You will own reliability for major platform domains, design...
Senior
Saviynt
San Francisco, CA
7 days ago
Senior Platform & Reliability Engineer
AngelList Venture in San Francisco is seeking a Senior Infrastructure Engineer to build and optimize platform infrastructure that supports billions in venture assets... ...developer productivity through automation and reliability practices. The ideal candidate has a solid...
Senior
Work at office
AngelList Venture
San Francisco, CA
2 days ago
Senior HPC & GPU Infrastructure Engineer
...Senior HPC & GPU Infrastructure Engineer Sciforium is an AI infrastructure company developing next-generation... ..., high-efficiency serving platform. Backed by multi-million-dollar funding... ...take full ownership of the health, reliability, and performance of our GPU...
Senior
Flexible hours
Sciforium
San Francisco, CA
9 days ago
Senior Platform & Reliability Engineer (SRE)
$200k - $250k
...unsolicited. About Vizcom Vizcom is a visual creation platform that combines modern web tooling with AI-... ...production infrastructure. We’re hiring a senior owner of stability and infrastructure to ensure the platform is reliable, fast, and resilient as we scale. Role...
Senior
Permanent employment
Vizcom
San Francisco, CA
2 days ago
Senior Site Reliability Engineer - Observability
...superintelligence. One person, one GPU. If you'd like to... ...is currently Tuesday. Engineering at Lambda is... ...operate observability platforms for logging, metrics,... ...monitoring for modern AI/HPC cluster infrastructure... ...adoptable and improve product reliability. Lead members of other...
Senior
Work at office
Local area
Work from home
Lambda
San Francisco, CA
1 day ago
Senior Platform & Reliability Engineer
Overview Senior Platform & Reliability Engineer OpenArt is an AI Storytelling and Visual Creation Platform used by millions worldwide. We’re building the next generation of creative tools powered by cutting-edge AI, enabling anyone to create videos, visuals, characters...
Senior
Remote work
Worldwide
Visa sponsorship
OpenArt AI
San Francisco, CA
1 day ago
Senior HPC GPU Compute Engineer (Hybrid SF)
A technology infrastructure company in San Francisco is seeking an experienced engineer to manage and operate GPU clusters. The role requires over 5 years of hands-on experience, a deep understanding of hardware systems, and a passion for automating fleet operations. You...
Senior
The San Francisco Compute Company
San Francisco, CA
3 days ago
Senior Site Reliability Engineer AI Infrastructure
Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco... ...it’s needed most. Our platform routes training and inference... ..., and debug large‑scale GPU infrastructure used for distributed... ...with Slurm or other HPC schedulers is equally valued...
Senior
Full time
Remote work
Cortes 23
San Francisco, CA
2 days ago
Senior AI Storage Engineer - Remote GPU HPC Infra
Hamilton Barnes Associates Limited is looking for a Senior Storage Engineer to support large-scale AI infrastructure in San Francisco. This... ...designing scalable storage solutions for high-performance GPU platforms. The ideal candidate has extensive experience in storage engineering...
Senior
Remote job
Hamilton Barnes Associates Limited
San Francisco, CA
4 days ago
Senior SRE — AI GPU Infra for Large-Scale HPC (IPO Equity)
$250k
Hamilton Barnes Associates Limited in San Francisco is seeking an experienced engineer to design and maintain large-scale GPU clusters for training and inference. The candidate should have over 7 years in SRE or DevOps, with strong skills in Kubernetes and Linux systems...
Senior
Hamilton Barnes Associates Limited
San Francisco, CA
5 days ago
Senior GPU HPC SRE — Remote, Stock Options
Hamilton Barnes Associates Limited is seeking a Senior / Staff Site Reliability Engineer in San Francisco, California. This role focuses on supporting and scaling HPC and cloud environments, improving automation and reliability across distributed systems. The ideal candidate...
Senior
Remote job
Hamilton Barnes Associates Limited
San Francisco, CA
4 days ago
System Infrastructure / Platform Engineer, HPC Technology Department
$156.86k - $191.72k
...System Infrastructure / Platform Engineer The National Energy Research Scientific... ...to help build and manage HPC systems and Linux-based... ...edge technologies such as CPU/GPU clusters, parallel storage, high... ..., balancing innovation with reliability, performance, and security at...
Full time
Remote work
Flexible hours
Berkely Lab
San Francisco, CA
2 days ago
Senior Manager, Site Reliability Engineering - Infrastructure Platform
$232k - $319k
...too, let's talk. The Infrastructure Platform and Shared Services Team Okta authenticates... ...scale the service with great people and reliable, cost-effective, and efficient... ...Accelerate the velocity of SRE and product engineering by developing robust platforms, powerful...
Senior
Permanent employment
Local area
Worldwide
Flexible hours
Okta, Inc.
San Francisco, CA
1 day ago
Hyperbolic Labs - Senior Site Reliability Engineer
...across the globe, we offer an innovative GPU marketplace and AI inference service... ...and affordable. Join us in building a platform that empowers innovators everywhere to... .... About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace...
Senior
deCircle
San Francisco, CA
1 day ago
Senior Infrastructure & Reliability Engineer - AI Platform
$157.7k - $277.8k
...Full time Location Type Hybrid Department Engineering, product & design Compensation SF & NYC... .... With WRITER's end-to-end platform, hundreds of companies like Mars, Marriott... ...platform must be available, performant, and reliable, 24/7. As an Infrastructure engineer, you...
Senior
Full time
Work at office
Local area
Flexible hours
Writer
San Francisco, CA
2 days ago
Sr. Director, SRE Platform Engineering
$202.8k - $327.63k
...Intelligent Agreement Management platform, companies can create, commit, and... ...management (CLM). What you’ll do The Senior Director, SRE Platform Engineering is a senior engineering leader... ...Service Management (ITSM) and Site Reliability Engineering (SRE) capabilities, applying...
Senior
Permanent employment
Contract work
Work at office
Local area
Remote work
2 days per week
DocuSign, Inc.
San Francisco, CA
1 day ago
Senior Site Reliability Engineer (SRE) - AI Inftastructure
$300k
...building out their AI and cloud platform, powered by thousands of H1... ...inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the... ...performance, and automation of this GPU-powered infrastructure,... ...-performance computing (HPC) or AI/ML training...
Senior
Permanent employment
San Francisco, CA
more than 2 months ago
Senior DevOps Engineer: AI Platform & GPU Infra
MixedBread AI in San Francisco is seeking a DevOps Engineer to join their core infrastructure team. You will be responsible for building... .... Ideal candidates have strong experience with cloud platforms and Infrastructure-as-Code tools and background in monitoring...
Senior
MixedBread AI
San Francisco, CA
4 days ago
Senior GPU Kernel Engineer - Accelerate AI Training Systems
MakerMaker, based in San Francisco, is seeking a highly skilled kernel engineer to write and optimize GPU kernels that enhance performance for training and inference. This role involves deep, low-level work to close the significant performance gap that exists in modern...
Senior
MakerMaker
San Francisco, CA
2 days ago
Senior HPC Systems Engineer - Research Compute & Storage
$120k - $196k
The CoreHPC team at the University of California - SAN Francisco is searching for an HPC Systems Engineer to enhance operations and maintenance of the HPC clusters. This role includes applying advanced solutions to resolve user issues and support researchers with computational...
Senior
University-of-California---SAN-Francisc
San Francisco, CA
5 days ago
Senior HPC Systems Engineer - Research Compute
The CoreHPC team at UCSF Health is looking for an HPC Systems Engineer to enhance and maintain the Institute’s HPC clusters. The role involves defining and implementing complex cyber-infrastructure and providing support to researchers. Ideal candidates will have a Bachelor...
Senior
UCSF Health
San Francisco, CA
2 days ago
Senior AI/ML Infra & SRE Engineer
Senior Infrastructure Engineer - Bland As a Senior Infrastructure Engineer at Bland... ...processing with strict latency and reliability requirements; building and... ...solving them; ensuring platform reliability through... ...for AI/ML workloads with GPU support, implementing container...
Senior
Temporary work
AI Chopping Block, Inc.
San Francisco, CA
4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior GPU HPC Platform Reliability Engineer. Be the first to apply!