Senior GPU HPC Platform Reliability Engineer
OpenAI
A leading AI research company in San Francisco is seeking a software engineer for its Fleet High Performance Computing team. In this role, you'll ensure the reliability and uptime of the compute fleet, working with automation systems and monitoring tools. Ideal candidates have experience managing server environments and proficiency in languages like Python or Go. Join us to innovate in AI technology while maintaining high system efficiency. #J-18808-Ljbffr OpenAI
- ...biotechnology firm in South San Francisco is seeking a Site Reliability Engineer to architect and implement Infrastructure as Code (IaC) solutions that enhance cloud-based platform solutions for Machine Learning and HPC workloads. The ideal candidate has extensive experience...Senior3 days per week
- A tech company focused on AI is seeking a Site Reliability Engineer to ensure the reliability and performance of its GPU marketplace. This role involves maintaining service level objectives, managing capacity, and implementing secure systems. The ideal candidate has strong...Senior
$250k
...infrastructure provider building a next-generation GPU platform designed for AI training, experimentation,... ...States. The company is looking for a Senior / Staff Site Reliability Engineer to support and scale large-scale HPC and cloud environments powering GPU-...SeniorPermanent employmentRemote work- Sciforium, an AI infrastructure company in San Francisco, is looking for a Senior HPC & GPU Infrastructure Engineer to manage the health and performance of our GPU compute cluster. You will be the primary custodian of a high-density accelerator environment, bridging hardware...SeniorFlexible hours
$163k - $203k
GoTo Meeting is looking for a Senior Site Reliability Engineer in San Francisco. You will be responsible for the reliability, scalability, and security of Prosper’s Cloud Platform portfolio. This role requires expertise in Kubernetes, cloud platforms (preferably GCP), and...Senior- An innovative R&D company in San Francisco is seeking a Site Reliability Engineer to join its Platform Engineering team. This position focuses on ensuring the reliability and performance of an AI-powered code review platform. The ideal candidate will have 6-8 years of experience...Senior
$200k - $250k
A leading visual creation platform in San Francisco is seeking a Senior Owner of Stability and Infrastructure. This hands-on technical leadership role demands expertise in service reliability to ensure the platform's performance as it scales. Responsibilities include setting...Senior- OpenArt AI in San Francisco is seeking a Senior Platform & Reliability Engineer to design and improve the reliability of its infrastructure. The role emphasizes building and operating production systems while collaborating with product engineers to ensure platform scalability...Senior
- ...identity security, delivering an AI-powered platform that governs and secures access to... ...cloud-native systems. As a Staff Platform Engineer, you will play a critical role in ensuring... ...technical leadership role. You will own reliability for major platform domains, design...Senior
- AngelList Venture in San Francisco is seeking a Senior Infrastructure Engineer to build and optimize platform infrastructure that supports billions in venture assets... ...developer productivity through automation and reliability practices. The ideal candidate has a solid...SeniorWork at office
- ...Senior HPC & GPU Infrastructure Engineer Sciforium is an AI infrastructure company developing next-generation... ..., high-efficiency serving platform. Backed by multi-million-dollar funding... ...take full ownership of the health, reliability, and performance of our GPU...SeniorFlexible hours
$200k - $250k
...unsolicited. About Vizcom Vizcom is a visual creation platform that combines modern web tooling with AI-... ...production infrastructure. We’re hiring a senior owner of stability and infrastructure to ensure the platform is reliable, fast, and resilient as we scale. Role...SeniorPermanent employment- ...superintelligence. One person, one GPU. If you'd like to... ...is currently Tuesday. Engineering at Lambda is... ...operate observability platforms for logging, metrics,... ...monitoring for modern AI/HPC cluster infrastructure... ...adoptable and improve product reliability. Lead members of other...SeniorWork at officeLocal areaWork from home
- Overview Senior Platform & Reliability Engineer OpenArt is an AI Storytelling and Visual Creation Platform used by millions worldwide. We’re building the next generation of creative tools powered by cutting-edge AI, enabling anyone to create videos, visuals, characters...SeniorRemote workWorldwideVisa sponsorship
- A technology infrastructure company in San Francisco is seeking an experienced engineer to manage and operate GPU clusters. The role requires over 5 years of hands-on experience, a deep understanding of hardware systems, and a passion for automating fleet operations. You...Senior
- Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco... ...it’s needed most. Our platform routes training and inference... ..., and debug large‑scale GPU infrastructure used for distributed... ...with Slurm or other HPC schedulers is equally valued...SeniorFull timeRemote work
- Hamilton Barnes Associates Limited is looking for a Senior Storage Engineer to support large-scale AI infrastructure in San Francisco. This... ...designing scalable storage solutions for high-performance GPU platforms. The ideal candidate has extensive experience in storage engineering...SeniorRemote job
$250k
Hamilton Barnes Associates Limited in San Francisco is seeking an experienced engineer to design and maintain large-scale GPU clusters for training and inference. The candidate should have over 7 years in SRE or DevOps, with strong skills in Kubernetes and Linux systems...Senior- Hamilton Barnes Associates Limited is seeking a Senior / Staff Site Reliability Engineer in San Francisco, California. This role focuses on supporting and scaling HPC and cloud environments, improving automation and reliability across distributed systems. The ideal candidate...SeniorRemote job
$156.86k - $191.72k
...System Infrastructure / Platform Engineer The National Energy Research Scientific... ...to help build and manage HPC systems and Linux-based... ...edge technologies such as CPU/GPU clusters, parallel storage, high... ..., balancing innovation with reliability, performance, and security at...Full timeRemote workFlexible hours$232k - $319k
...too, let's talk. The Infrastructure Platform and Shared Services Team Okta authenticates... ...scale the service with great people and reliable, cost-effective, and efficient... ...Accelerate the velocity of SRE and product engineering by developing robust platforms, powerful...SeniorPermanent employmentLocal areaWorldwideFlexible hours- ...across the globe, we offer an innovative GPU marketplace and AI inference service... ...and affordable. Join us in building a platform that empowers innovators everywhere to... .... About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace...Senior
$157.7k - $277.8k
...Full time Location Type Hybrid Department Engineering, product & design Compensation SF & NYC... .... With WRITER's end-to-end platform, hundreds of companies like Mars, Marriott... ...platform must be available, performant, and reliable, 24/7. As an Infrastructure engineer, you...SeniorFull timeWork at officeLocal areaFlexible hours$202.8k - $327.63k
...Intelligent Agreement Management platform, companies can create, commit, and... ...management (CLM). What you’ll do The Senior Director, SRE Platform Engineering is a senior engineering leader... ...Service Management (ITSM) and Site Reliability Engineering (SRE) capabilities, applying...SeniorPermanent employmentContract workWork at officeLocal areaRemote work2 days per week$300k
...building out their AI and cloud platform, powered by thousands of H1... ...inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the... ...performance, and automation of this GPU-powered infrastructure,... ...-performance computing (HPC) or AI/ML training...SeniorPermanent employment- MixedBread AI in San Francisco is seeking a DevOps Engineer to join their core infrastructure team. You will be responsible for building... .... Ideal candidates have strong experience with cloud platforms and Infrastructure-as-Code tools and background in monitoring...Senior
- MakerMaker, based in San Francisco, is seeking a highly skilled kernel engineer to write and optimize GPU kernels that enhance performance for training and inference. This role involves deep, low-level work to close the significant performance gap that exists in modern...Senior
$120k - $196k
The CoreHPC team at the University of California - SAN Francisco is searching for an HPC Systems Engineer to enhance operations and maintenance of the HPC clusters. This role includes applying advanced solutions to resolve user issues and support researchers with computational...Senior- The CoreHPC team at UCSF Health is looking for an HPC Systems Engineer to enhance and maintain the Institute’s HPC clusters. The role involves defining and implementing complex cyber-infrastructure and providing support to researchers. Ideal candidates will have a Bachelor...Senior
- Senior Infrastructure Engineer - Bland As a Senior Infrastructure Engineer at Bland... ...processing with strict latency and reliability requirements; building and... ...solving them; ensuring platform reliability through... ...for AI/ML workloads with GPU support, implementing container...SeniorTemporary work
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior GPU HPC Platform Reliability Engineer. Be the first to apply!
- platform developer San Francisco, CA
- platform engineer San Francisco, CA
- platform engineering manager San Francisco, CA
- data platform engineer San Francisco, CA
- client platform engineer San Francisco, CA
- senior platform engineer San Francisco, CA
- reliability engineering manager San Francisco, CA
- senior reliability engineer San Francisco, CA
- reliability engineer San Francisco, CA
- network reliability engineer San Francisco, CA


