Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior GPU HPC Platform Reliability Engineer

OpenAI

A leading AI research company in San Francisco is seeking a software engineer for its Fleet High Performance Computing team. In this role, you'll ensure the reliability and uptime of the compute fleet, working with automation systems and monitoring tools. Ideal candidates have experience managing server environments and proficiency in languages like Python or Go. Join us to innovate in AI technology while maintaining high system efficiency. #J-18808-Ljbffr OpenAI

Vacancy posted 3 days ago
Similar jobs that could be interesting for youBased on the Senior GPU HPC Platform Reliability Engineer in San Francisco, CA vacancy
  •  ...biotechnology firm in South San Francisco is seeking a Site Reliability Engineer to architect and implement Infrastructure as Code (IaC) solutions that enhance cloud-based platform solutions for Machine Learning and HPC workloads. The ideal candidate has extensive experience... 
    Senior
    3 days per week

    Genentech

    South San Francisco, CA
    2 days ago
  • A tech company focused on AI is seeking a Site Reliability Engineer to ensure the reliability and performance of its GPU marketplace. This role involves maintaining service level objectives, managing capacity, and implementing secure systems. The ideal candidate has strong... 
    Senior

    Hyperbolic Labs

    San Francisco, CA
    4 days ago
  • $250k

     ...infrastructure provider building a next-generation GPU platform designed for AI training, experimentation,...  ...States. The company is looking for a Senior / Staff Site Reliability Engineer to support and scale large-scale HPC and cloud environments powering GPU-... 
    Senior
    Permanent employment
    Remote work
    San Francisco, CA
    19 days ago
  • Sciforium, an AI infrastructure company in San Francisco, is looking for a Senior HPC & GPU Infrastructure Engineer to manage the health and performance of our GPU compute cluster. You will be the primary custodian of a high-density accelerator environment, bridging hardware... 
    Senior
    Flexible hours

    Sciforium

    San Francisco, CA
    3 days ago
  • $163k - $203k

    GoTo Meeting is looking for a Senior Site Reliability Engineer in San Francisco. You will be responsible for the reliability, scalability, and security of Prosper’s Cloud Platform portfolio. This role requires expertise in Kubernetes, cloud platforms (preferably GCP), and... 
    Senior

    GoTo Meeting

    San Francisco, CA
    2 days ago
  • An innovative R&D company in San Francisco is seeking a Site Reliability Engineer to join its Platform Engineering team. This position focuses on ensuring the reliability and performance of an AI-powered code review platform. The ideal candidate will have 6-8 years of experience... 
    Senior

    CodeRabbit

    San Francisco, CA
    5 days ago
  • $200k - $250k

    A leading visual creation platform in San Francisco is seeking a Senior Owner of Stability and Infrastructure. This hands-on technical leadership role demands expertise in service reliability to ensure the platform's performance as it scales. Responsibilities include setting... 
    Senior

    Vizcom

    San Francisco, CA
    2 days ago
  • OpenArt AI in San Francisco is seeking a Senior Platform & Reliability Engineer to design and improve the reliability of its infrastructure. The role emphasizes building and operating production systems while collaborating with product engineers to ensure platform scalability... 
    Senior

    OpenArt AI

    San Francisco, CA
    1 day ago
  •  ...identity security, delivering an AI-powered platform that governs and secures access to...  ...cloud-native systems. As a Staff Platform Engineer, you will play a critical role in ensuring...  ...technical leadership role. You will own reliability for major platform domains, design... 
    Senior

    Saviynt

    San Francisco, CA
    7 days ago
  • AngelList Venture in San Francisco is seeking a Senior Infrastructure Engineer to build and optimize platform infrastructure that supports billions in venture assets...  ...developer productivity through automation and reliability practices. The ideal candidate has a solid... 
    Senior
    Work at office

    AngelList Venture

    San Francisco, CA
    2 days ago
  •  ...Senior HPC & GPU Infrastructure Engineer Sciforium is an AI infrastructure company developing next-generation...  ..., high-efficiency serving platform. Backed by multi-million-dollar funding...  ...take full ownership of the health, reliability, and performance of our GPU... 
    Senior
    Flexible hours

    Sciforium

    San Francisco, CA
    9 days ago
  • $200k - $250k

     ...unsolicited. About Vizcom Vizcom is a visual creation platform that combines modern web tooling with AI-...  ...production infrastructure. We’re hiring a senior owner of stability and infrastructure to ensure the platform is reliable, fast, and resilient as we scale. Role... 
    Senior
    Permanent employment

    Vizcom

    San Francisco, CA
    2 days ago
  •  ...superintelligence. One person, one GPU. If you'd like to...  ...is currently Tuesday. Engineering at Lambda is...  ...operate observability platforms for logging, metrics,...  ...monitoring for modern AI/HPC cluster infrastructure...  ...adoptable and improve product reliability. Lead members of other... 
    Senior
    Work at office
    Local area
    Work from home

    Lambda

    San Francisco, CA
    1 day ago
  • Overview Senior Platform & Reliability Engineer OpenArt is an AI Storytelling and Visual Creation Platform used by millions worldwide. We’re building the next generation of creative tools powered by cutting-edge AI, enabling anyone to create videos, visuals, characters... 
    Senior
    Remote work
    Worldwide
    Visa sponsorship

    OpenArt AI

    San Francisco, CA
    1 day ago
  • A technology infrastructure company in San Francisco is seeking an experienced engineer to manage and operate GPU clusters. The role requires over 5 years of hands-on experience, a deep understanding of hardware systems, and a passion for automating fleet operations. You... 
    Senior

    The San Francisco Compute Company

    San Francisco, CA
    3 days ago
  • Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco...  ...it’s needed most. Our platform routes training and inference...  ..., and debug large‑scale GPU infrastructure used for distributed...  ...with Slurm or other HPC schedulers is equally valued... 
    Senior
    Full time
    Remote work

    Cortes 23

    San Francisco, CA
    2 days ago
  • Hamilton Barnes Associates Limited is looking for a Senior Storage Engineer to support large-scale AI infrastructure in San Francisco. This...  ...designing scalable storage solutions for high-performance GPU platforms. The ideal candidate has extensive experience in storage engineering... 
    Senior
    Remote job

    Hamilton Barnes Associates Limited

    San Francisco, CA
    4 days ago
  • $250k

    Hamilton Barnes Associates Limited in San Francisco is seeking an experienced engineer to design and maintain large-scale GPU clusters for training and inference. The candidate should have over 7 years in SRE or DevOps, with strong skills in Kubernetes and Linux systems... 
    Senior

    Hamilton Barnes Associates Limited

    San Francisco, CA
    5 days ago
  • Hamilton Barnes Associates Limited is seeking a Senior / Staff Site Reliability Engineer in San Francisco, California. This role focuses on supporting and scaling HPC and cloud environments, improving automation and reliability across distributed systems. The ideal candidate... 
    Senior
    Remote job

    Hamilton Barnes Associates Limited

    San Francisco, CA
    4 days ago
  • $156.86k - $191.72k

     ...System Infrastructure / Platform Engineer The National Energy Research Scientific...  ...to help build and manage HPC systems and Linux-based...  ...edge technologies such as CPU/GPU clusters, parallel storage, high...  ..., balancing innovation with reliability, performance, and security at... 
    Full time
    Remote work
    Flexible hours

    Berkely Lab

    San Francisco, CA
    2 days ago
  • $232k - $319k

     ...too, let's talk. The Infrastructure Platform and Shared Services Team Okta authenticates...  ...scale the service with great people and reliable, cost-effective, and efficient...  ...Accelerate the velocity of SRE and product engineering by developing robust platforms, powerful... 
    Senior
    Permanent employment
    Local area
    Worldwide
    Flexible hours

    Okta, Inc.

    San Francisco, CA
    1 day ago
  •  ...across the globe, we offer an innovative GPU marketplace and AI inference service...  ...and affordable. Join us in building a platform that empowers innovators everywhere to...  .... About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace... 
    Senior

    deCircle

    San Francisco, CA
    1 day ago
  • $157.7k - $277.8k

     ...Full time Location Type Hybrid Department Engineering, product & design Compensation SF & NYC...  .... With WRITER's end-to-end platform, hundreds of companies like Mars, Marriott...  ...platform must be available, performant, and reliable, 24/7. As an Infrastructure engineer, you... 
    Senior
    Full time
    Work at office
    Local area
    Flexible hours

    Writer

    San Francisco, CA
    2 days ago
  • $202.8k - $327.63k

     ...Intelligent Agreement Management platform, companies can create, commit, and...  ...management (CLM). What you’ll do The Senior Director, SRE Platform Engineering is a senior engineering leader...  ...Service Management (ITSM) and Site Reliability Engineering (SRE) capabilities, applying... 
    Senior
    Permanent employment
    Contract work
    Work at office
    Local area
    Remote work
    2 days per week

    DocuSign, Inc.

    San Francisco, CA
    1 day ago
  • $300k

     ...building out their AI and cloud platform, powered by thousands of H1...  ...inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the...  ...performance, and automation of this GPU-powered infrastructure,...  ...-performance computing (HPC) or AI/ML training... 
    Senior
    Permanent employment
    San Francisco, CA
    more than 2 months ago
  • MixedBread AI in San Francisco is seeking a DevOps Engineer to join their core infrastructure team. You will be responsible for building...  .... Ideal candidates have strong experience with cloud platforms and Infrastructure-as-Code tools and background in monitoring... 
    Senior

    MixedBread AI

    San Francisco, CA
    4 days ago
  • MakerMaker, based in San Francisco, is seeking a highly skilled kernel engineer to write and optimize GPU kernels that enhance performance for training and inference. This role involves deep, low-level work to close the significant performance gap that exists in modern... 
    Senior

    MakerMaker

    San Francisco, CA
    2 days ago
  • $120k - $196k

    The CoreHPC team at the University of California - SAN Francisco is searching for an HPC Systems Engineer to enhance operations and maintenance of the HPC clusters. This role includes applying advanced solutions to resolve user issues and support researchers with computational... 
    Senior

    University-of-California---SAN-Francisc

    San Francisco, CA
    5 days ago
  • The CoreHPC team at UCSF Health is looking for an HPC Systems Engineer to enhance and maintain the Institute’s HPC clusters. The role involves defining and implementing complex cyber-infrastructure and providing support to researchers. Ideal candidates will have a Bachelor... 
    Senior

    UCSF Health

    San Francisco, CA
    2 days ago
  • Senior Infrastructure Engineer - Bland As a Senior Infrastructure Engineer at Bland...  ...processing with strict latency and reliability requirements; building and...  ...solving them; ensuring platform reliability through...  ...for AI/ML workloads with GPU support, implementing container... 
    Senior
    Temporary work

    AI Chopping Block, Inc.

    San Francisco, CA
    4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior GPU HPC Platform Reliability Engineer. Be the first to apply!