Hyperbolic Labs - Senior Site Reliability Engineer

deCircle

Hyperbolic Labs is on a mission to democratize AI by breaking down the barriers to computing power with our Open-Access AI Cloud. By aggregating computing resources across the globe, we offer an innovative GPU marketplace and AI inference service that promise affordability and accessibility for all. As pioneers at the intersection of AI and open‑source technology, we believe in an open future where AI innovation is limited only by imagination, not by access to resources. We're looking for forward‑thinking individuals who share our passion for making AI universally accessible, secure, and affordable. Join us in building a platform that empowers innovators everywhere to turn their visionary AI projects into reality. As we prepare for growth after our Series A, our team — led by co‑founders with PhDs in AI, Math, and Computer Science — is poised to redefine computing. About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure operate with exceptional reliability, performance, and security. As an aggregator of compute resources from hundreds of global suppliers, our SLOs, trust, and economic efficiency are product‑critical. You'll be responsible for defining and maintaining service level objectives for job success rates, building robust incident response systems, managing capacity across our distributed GPU network, and implementing secure rollout and rollback mechanisms that keep our platform running smoothly 24/7. In this role, you'll establish the reliability standards that define customer trust in our platform, design monitoring and alerting systems that provide deep visibility into our infrastructure, build automation for capacity management and resource allocation, lead incident response and post‑mortem processes, and work closely with engineering teams to improve system resilience. You'll also focus on security and infrastructure hardening, ensuring strong isolation between tenants and suppliers, implementing key management systems, and building compliance frameworks. This is a high‑impact position where your work directly influences our ability to deliver on our promise of affordable, accessible AI compute at scale. Expert in site reliability engineering with proven experience defining, monitoring, and maintaining SLOs and SLAs for production systems Strong background in capacity planning and management, including forecasting, resource allocation, and cost optimization for distributed systems Experienced in incident response, on‑call rotations, and post‑mortem processes with a track record of reducing MTTR and improving system resilience Deep knowledge of deployment systems including progressive rollouts, canary deployments, feature flags, and automated rollback mechanisms Proficient in observability tools and practices including metrics, logging, tracing, and alerting systems (Prometheus, Grafana, ELK stack, or similar) Strong understanding of infrastructure security including tenant isolation, workload isolation, network segmentation, and security hardening Experience with secrets management, key management systems (KMS), certificate management, and secure credential rotation Knowledge of compliance frameworks and security best practices for cloud platforms (SOC 2, ISO 27001, or similar) Excellent problem‑solving skills with ability to debug complex distributed systems issues under pressure Strong automation mindset with experience using infrastructure‑as‑code, configuration management, and CI/CD pipelines Preferred Qualifications Experience operating GPU infrastructure, AI/ML platforms, or compute marketplaces at scale Background in distributed systems, peer‑to‑peer networks, or decentralized infrastructure Knowledge of multi‑tenancy security patterns, container security, and runtime security tools Experience with chaos engineering, fault injection, and resilience testing Familiarity with cost optimization strategies for cloud infrastructure and GPU resources Experience building and operating systems with demanding uptime requirements (99.9%+ SLAs) Background at companies like AWS, Google Cloud, Azure, or fast‑growing infrastructure startups Contributions to open‑source reliability, observability, or security tools #J-18808-Ljbffr deCircle

Apply

Vacancy posted 1 day ago

Similar jobs that could be interesting for youBased on the Hyperbolic Labs - Senior Site Reliability Engineer in San Francisco, CA vacancy

Senior Site Reliability Engineer AI Infrastructure
Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco • Full-Time About Andromeda Andromeda Cluster was founded... ...more accessible. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when...
Senior
Full time
Remote work
Cortes 23
San Francisco, CA
2 days ago
Senior Site Reliability Engineer
$174.92k - $209.91k
...access to data as simple and reliable as electricity. With Fivetran... ...and ready to query, with no engineering or maintenance required. We’re... ...About Us Fivetran and dbt Labs are bringing together two industry... ...teams, systems, and career sites. About the Role Fivetran...
Senior
Full time
Work at office
Remote work
Fivetran
Oakland, CA
4 days ago
Senior GPU Cloud Infrastructure Engineer
A tech startup in AI is seeking a Senior Infrastructure Engineer in San Francisco, CA. This role involves building and scaling a GPU Cloud Marketplace... ...that cut costs significantly. Join this mission-driven company to revolutionize AI access. #J-18808-Ljbffr Hyperbolic Labs
Senior
Hyperbolic Labs
San Francisco, CA
4 days ago
Senior Site Reliability Engineer
...About the job Senior Site Reliability Engineer About the Company Stellar is a decentralized, public blockchain that gives developers the tools to create experiences that are more like cash than crypto. The network is faster, cheaper, and far more energy-efficient...
Senior
TechChain Talent
San Francisco, CA
2 days ago
Senior Site Reliability Engineer
$160k - $250k
...public clouds when the right fit. As we continue to commercialize our machine learning models, we also need to grow our DevOps and Site Reliability team to maintain the reliability of our enterprise SaaS offering for our customers. Our ideal candidate is someone who is able...
Senior
Hive
San Francisco, CA
1 day ago
Senior Site Reliability Engineer
...advanced algorithms that significantly outperforms individual engineers. We combine language models with human ingenuity to push the... ...quality. The Role: We are seeking an experienced Site Reliability Engineer to join our Platform Engineering team in the Bay Area...
Senior
CodeRabbit
San Francisco, CA
3 days ago
Site Reliability Engineer - AI Infrastructure
Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was founded... ...infrastructure more accessible. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when...
Full time
Remote work
Andromeda Cluster
San Francisco, CA
4 days ago
Infrastructure & Site Reliability Engineer
$125k - $195k
...team of exceptional, hands-on engineers to make this happen.... ...better. We believe our team and lab can build anything. We’ve set... ...seeking an Infrastructure & Site Reliability Engineer to design, build, deploy... ...exceptional early-career engineers to senior and staff-level builders....
Work at office
Visa sponsorship
Night shift
Atomicsemi
San Francisco, CA
4 days ago
Senior Software Engineer - Site Reliability Engineering
...Udaip Cloud-Based Data And Ai Platform Engineer At U.S. Bank, we're on a journey to do our best. Helping the customers and businesses we serve to make better and smarter financial decisions and enabling the communities we support to grow and succeed. We believe it...
Senior
Temporary work
Work experience placement
Phenom People
San Francisco, CA
1 day ago
Senior Site Reliability Engineer
US Corp. is seeking a Lead Site Reliability Engineer to spearhead our mission of delivering highly available and performant systems. With an average of over 12 years of industry experience, the successful candidate will bridge the gap between software development and systems...
Senior
Axiom Pursuits
San Francisco, CA
2 days ago
Senior Site Reliability Engineer - AI-Driven, Scalable Infra
OutSystems, Inc. is looking for a Site Reliability Engineer to join their team in San Francisco, CA. The ideal candidate will lead the onboarding of services and teams to reliability tenets while establishing SLOs and SLAs. Proficiency in Python and experience with Kubernetes...
Senior
Flexible hours
OutSystems, Inc.
San Francisco, CA
2 days ago
Senior Solutions Engineer
$180k - $270k
...Senior Solutions Engineer San Francisco (USA) About Black Forest Labs We're the team behind Latent Diffusion, Stable Diffusion, and FLUX—foundational technologies that changed how the world creates images and video. We're creating the generative models that power...
Senior
Remote work
Worldwide
2 days per week
Black Forest Labs
San Francisco, CA
1 day ago
Senior Site Reliability Engineer (SRE) - AI Inftastructure
$300k
...thousands of H100s, H200s, and B200s, ready for experimentation, full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring...
Senior
Hamilton Barnes Associates Limited
San Francisco, CA
4 days ago
Senior Site Reliability Engineer (Upmarket)
...alongside clinicians to make that possible. We’re a team of doctors, engineers, designers, researchers, and creatives building tools that... ...for leading incidents end-to-end. Improve operational reliability: Identify recurring issues and reliability risks, and drive fixes...
Senior
Work at office
Worldwide
Heidi Health Ltd
San Francisco, CA
2 days ago
Senior Site Reliability Engineer
...Responsibilities Lead and onboard services and teams to the reliability tenets. Establish and maintain Service Level Objectives (... ...Science or equivalent. 6+ years of experience in Site Reliability Engineering, managing infrastructure and services at scale. History of...
Senior
OutSystems, Inc.
San Francisco, CA
2 days ago
Senior Site Reliability Engineer, Spend
What you’ll do As a Senior Site Reliability Engineer, you’ll work closely with product teams in Spend to deliver and maintain scalable, reliable cloud infrastructure in support of key product initiatives. Aligned to the roadmap, you’ll lead on infrastructure design and...
Senior
Airwallex Pty Ltd.
San Francisco, CA
1 day ago
Senior Site Reliability Engineer
...acquisition, and Connor was a machine learning research engineer at Scale AI. The rest of our team comes from... ...redefining go-to-market with state-of-the-art AI. As a Senior SRE, you'll tackle the scaling and reliability challenges that come with adding terabytes of data...
Senior
Unify
San Francisco, CA
2 days ago
Senior Site Reliability Engineer
$140k - $220k
About the Job You’ll own reliability and operational excellence for Pylon's production systems. This means designing and implementing... ...scale as we grow. You'll build tooling that makes the entire engineering team more effective, establish on-call rotations and runbooks...
Senior
Pylon
San Francisco, CA
5 days ago
Remote Senior Site Reliability Engineer (SRE) - Zetachain
We are seeking a Sr. Site Reliability Engineer to join our team and run critical infrastructure for our blockchain and web applications. You’ll learn to deploy and maintain a fleet of RPC and validator nodes for multiple blockchain networks. You’ll also provide guidance...
Senior
Remote job
Blockchain Works
San Francisco, CA
24 days ago
Senior Site Reliability Engineer
$210k - $240k
Join to apply for the Senior Site Reliability Engineer role at Alembic Technologies This range is provided by Alembic Technologies. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Base pay range $210,000.00/yr - $...
Senior
Full time
Alembic Technologies
San Francisco, CA
3 days ago
Senior Site Reliability Engineer
...about this role, we encourage you to apply. The Role As a Senior Platform Engineer, you are a champion for DevOps and SRE culture and... ...goals are met. What You Will Be Doing Improving production reliability and system resilience within an SRE scoped team Championing...
Senior
Flexible hours
Megaport
Brisbane, CA
2 days ago
Senior Site Reliability Engineer
$60 per hour
Senior Site Reliability Engineer (Copy) Seattle Hybrid (Hybrid location). Full-time. About Us Supio is a trusted AI platform purpose-built for law firms, reshaping how data drives impactful outcomes. Our innovative approach blends technology with deep legal expertise,...
Senior
Full time
Work at office
Flexible hours
Bonfirevc
San Francisco, CA
2 days ago
Senior Site Reliability Engineer
For more information, please read ourSenior Site Reliability Engineer page is loaded## Senior Site Reliability Engineerlocations: US - San Francisco Bay Areatime type: Full timeposted on: Posted Yesterdayjob requisition id: R1478**There are NO limits to your career: come...
Senior
Immediate start
Remote work
Worldwide
OutSystems Inc.
San Francisco, CA
2 days ago
Senior Robotics Software Engineer — Motion & System Design
$160k - $250k
Multiply-Labs in San Francisco is seeking a Senior Robotics Software Engineer to lead the development of software powering automated manufacturing systems. The role involves design, core algorithm development, simulation, and cross-functional collaboration. Candidates...
Senior
Flexible hours
Multiply-Labs
San Francisco, CA
8 days ago
Senior Site Reliability Engineer - Hiring Sprint
...everyone, everywhere. That everyone now includes AI agents. Engineering Hiring Sprint We're growing our engineering team and are accelerating... ...teams, including: Platform Engineers Database Engineers Site Reliability Engineers Extensibility API Engineers AI Agents Engineers...
Senior
Work at office
Local area
Flexible hours
Airbyte
San Francisco, CA
4 days ago
Senior Site Reliability Engineer
$117k - $209.33k
Position Overview Want to help make a better world? As a Senior Site Reliability Engineer at Autodesk, you will build and operate reliable, secure, and scalable cloud services for Autodesk GovCloud products. This foundational role helps establish the operating model, reliability...
Senior
Autodesk, Inc.
San Francisco, CA
5 days ago
Senior Site Reliability Engineer
# Senior Site Reliability EngineerHybrid - San Francisco**Our Mission & Values:** At Drata, we help companies earn and keep the trust of their... ...Job Summary:**Drata's SRE team operates as both a central engineering function and an embedded reliability practice. You'll be...
Senior
Work at office
Immediate start
Worldwide
Monday to Friday
Flexible hours
Careers at Drata
San Francisco, CA
3 days ago
Senior Site Reliability Engineer
$175k - $250k
...00.00/yr - $250,000.00/yr Job Title: Senior Cloud Infrastructure Engineer Location: San Francisco, CA. Remote unavailable. Modality: On-Site only. Must live within commuting distance... ...scalability, performance, and reliability across environments. What You’ll Do...
Senior
Full time
Remote work
Relocation
Relocation package
The Recruiting Guy
San Francisco, CA
3 days ago
Senior Site Reliability Engineer, Fleet Management
$127k - $249k
The Team Platform Engineering is the department within SRE that is responsible for a range of critical infrastructure and operational... ...fleet, alongside the critical components that ensure cluster reliability and security (e.g., CoreDNS, cert-manager, and Gatekeeper). As...
Senior
Work at office
Local area
Remote work
Worldwide
Flexible hours
MongoDB
San Francisco, CA
3 days ago
Senior+ Site Reliability Engineer
$300 per month
...intelligence. We’re crafting the engine that powers a world where... ...Role Crusoe is building the most reliable, energy-efficient, AI-... ...the heart of that mission. As a Site Reliability Engineer focused on... ...You’ll partner closely with senior SREs, infrastructure engineers...
Senior
Temporary work
Dormont Manufacturing Co
San Francisco, CA
4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Hyperbolic Labs - Senior Site Reliability Engineer. Be the first to apply!