Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Hyperbolic Labs - Senior Site Reliability Engineer

deCircle

Hyperbolic Labs is on a mission to democratize AI by breaking down the barriers to computing power with our Open-Access AI Cloud. By aggregating computing resources across the globe, we offer an innovative GPU marketplace and AI inference service that promise affordability and accessibility for all. As pioneers at the intersection of AI and open‑source technology, we believe in an open future where AI innovation is limited only by imagination, not by access to resources. We're looking for forward‑thinking individuals who share our passion for making AI universally accessible, secure, and affordable. Join us in building a platform that empowers innovators everywhere to turn their visionary AI projects into reality. As we prepare for growth after our Series A, our team — led by co‑founders with PhDs in AI, Math, and Computer Science — is poised to redefine computing. About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure operate with exceptional reliability, performance, and security. As an aggregator of compute resources from hundreds of global suppliers, our SLOs, trust, and economic efficiency are product‑critical. You'll be responsible for defining and maintaining service level objectives for job success rates, building robust incident response systems, managing capacity across our distributed GPU network, and implementing secure rollout and rollback mechanisms that keep our platform running smoothly 24/7. In this role, you'll establish the reliability standards that define customer trust in our platform, design monitoring and alerting systems that provide deep visibility into our infrastructure, build automation for capacity management and resource allocation, lead incident response and post‑mortem processes, and work closely with engineering teams to improve system resilience. You'll also focus on security and infrastructure hardening, ensuring strong isolation between tenants and suppliers, implementing key management systems, and building compliance frameworks. This is a high‑impact position where your work directly influences our ability to deliver on our promise of affordable, accessible AI compute at scale. Expert in site reliability engineering with proven experience defining, monitoring, and maintaining SLOs and SLAs for production systems Strong background in capacity planning and management, including forecasting, resource allocation, and cost optimization for distributed systems Experienced in incident response, on‑call rotations, and post‑mortem processes with a track record of reducing MTTR and improving system resilience Deep knowledge of deployment systems including progressive rollouts, canary deployments, feature flags, and automated rollback mechanisms Proficient in observability tools and practices including metrics, logging, tracing, and alerting systems (Prometheus, Grafana, ELK stack, or similar) Strong understanding of infrastructure security including tenant isolation, workload isolation, network segmentation, and security hardening Experience with secrets management, key management systems (KMS), certificate management, and secure credential rotation Knowledge of compliance frameworks and security best practices for cloud platforms (SOC 2, ISO 27001, or similar) Excellent problem‑solving skills with ability to debug complex distributed systems issues under pressure Strong automation mindset with experience using infrastructure‑as‑code, configuration management, and CI/CD pipelines Preferred Qualifications Experience operating GPU infrastructure, AI/ML platforms, or compute marketplaces at scale Background in distributed systems, peer‑to‑peer networks, or decentralized infrastructure Knowledge of multi‑tenancy security patterns, container security, and runtime security tools Experience with chaos engineering, fault injection, and resilience testing Familiarity with cost optimization strategies for cloud infrastructure and GPU resources Experience building and operating systems with demanding uptime requirements (99.9%+ SLAs) Background at companies like AWS, Google Cloud, Azure, or fast‑growing infrastructure startups Contributions to open‑source reliability, observability, or security tools #J-18808-Ljbffr deCircle

Vacancy posted 13 hours ago
Similar jobs that could be interesting for youBased on the Hyperbolic Labs - Senior Site Reliability Engineer in San Francisco, CA vacancy
  • A tech company focused on AI is seeking a Site Reliability Engineer to ensure the reliability and performance of its GPU marketplace. This role...  ...incident response. Join a forward-thinking environment committed to revolutionizing AI access. #J-18808-Ljbffr Hyperbolic Labs
    Senior

    Hyperbolic Labs

    San Francisco, CA
    3 days ago
  • Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco • Full-Time About Andromeda Andromeda Cluster was founded...  ...more accessible. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when... 
    Senior
    Full time
    Remote work

    Cortes 23

    San Francisco, CA
    1 day ago
  • Hyperbolic Labs is on a mission to democratize AI by breaking down the barriers to computing power with our Open-Access AI Cloud...  ...to redefine computing. About the Role We're seeking a Senior Infrastructure Engineer to help build and scale Hyperbolic's GPU Cloud... 
    Senior
    Remote work

    deCircle

    San Francisco, CA
    3 hours ago
  • A tech startup in AI is seeking a Senior Infrastructure Engineer in San Francisco, CA. This role involves building and scaling a GPU Cloud Marketplace...  ...that cut costs significantly. Join this mission-driven company to revolutionize AI access. #J-18808-Ljbffr Hyperbolic Labs
    Senior

    Hyperbolic Labs

    San Francisco, CA
    3 days ago
  •  ...The TeamPlatform Engineering is the department within SRE that is responsible for a range of critical infrastructure and operational...  ...fleet, alongside the critical components that ensure cluster reliability and security (e.g., CoreDNS, cert-manager, and Gatekeeper). As... 
    Senior
    Work at office
    Local area
    Remote work
    Worldwide
    Flexible hours

    MongoDB

    San Francisco, CA
    4 days ago
  • $140k - $205k

     ...Senior Technology Site Reliability Engineer Cooley is seeking a Senior Site Reliability Engineer to join the Infrastructure & Development Operationsteam. Position summary: The Senior Technology Site Reliability Engineer("SRE") is responsible for ensuring the reliability... 
    Senior
    Full time
    Temporary work
    Work at office
    Flexible hours
    Weekend work

    Cooley

    San Francisco, CA
    4 days ago
  • Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was founded...  ...infrastructure more accessible. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when... 
    Full time
    Remote work

    Andromeda Cluster

    San Francisco, CA
    3 days ago
  • $180k - $210k

     ...Employment Type Full time Location Type Remote Department Tech Engineering Compensation $180K - $210K • Offers Equity The base salary &...  .... VISA support (such as H1B and OPT transfer for US employees). Compensation Range: $180K - $210K #J-18808-Ljbffr Twelve Labs
    Full time
    H1b
    Work at office
    Remote work
    Worldwide
    Visa sponsorship
    Flexible hours

    Twelve Labs

    San Francisco, CA
    4 days ago
  •  ...multimodal AI. About the Role As a Sr. Staff Infrastructure Engineer at Twelve Labs, you will combine technical leadership with hands‑on...  ...and direct execution when needed. Own key tradeoffs across reliability, cost, and velocity, making pragmatic decisions in a startup... 
    H1b
    Work at office
    Worldwide
    Visa sponsorship
    Flexible hours

    Twelve-Labs

    San Francisco, CA
    4 days ago
  • US Corp. is seeking a Lead Site Reliability Engineer to spearhead our mission of delivering highly available and performant systems. With an average of over 12 years of industry experience, the successful candidate will bridge the gap between software development and systems... 
    Senior

    Axiom Pursuits

    San Francisco, CA
    1 day ago
  • OutSystems, Inc. is looking for a Site Reliability Engineer to join their team in San Francisco, CA. The ideal candidate will lead the onboarding of services and teams to reliability tenets while establishing SLOs and SLAs. Proficiency in Python and experience with Kubernetes... 
    Senior
    Flexible hours

    OutSystems, Inc.

    San Francisco, CA
    1 day ago
  • Fieldguide is seeking a Senior Site Reliability Engineer to ensure the reliability and scalability of our production systems in San Francisco, CA. The role involves working closely with product teams to define reliability standards and build robust observability practices... 
    Senior
    Remote job
    Flexible hours

    Fieldguide

    San Francisco, CA
    4 days ago
  • $227.2k - $324.5k

     ...About the Role: Site Reliability Engineering (SRE) at Tubi is not a traditional operations team. We are a software engineering organization...  ...automation. We are seeking an experienced and visionary Senior SRE Manager to lead and grow our newly built Site Reliability... 
    Senior
    Full time
    Contract work
    Temporary work
    Local area
    Flexible hours

    Tubi

    San Francisco, CA
    4 days ago
  • $210.6k - $305.1k

     ...Qualifications: ~ You have led a distributed team of 5+ engineers, can demonstrate strong technical vision for your team, and ensure...  ..., and basic life insurance. Please see the Cisco careers site to discover more benefits and perks. Employees may be eligible... 
    Senior
    Full time
    Temporary work
    Local area
    Flexible hours

    Cisco

    San Francisco, CA
    5 days ago
  • $162k - $288k

     ...Senior System Software Engineer San Francisco, CA HP IQ is HP's new AI innovation lab. Combining startup agility with HP's global scale, we're building intelligent technologies that redefine how the world works, creates, and collaborates. We're assembling a diverse... 
    Senior
    Full time
    Temporary work
    Local area
    Flexible hours

    HP IQ

    San Francisco, CA
    4 days ago
  •  ...work from home day is currently Tuesday. Engineering at Lambda is responsible for building...  ...observability adoptable and improve product reliability. Lead members of other engineering teams...  ...in Go Have 5+ years of experience in Site Reliability Engineering practices Possess... 
    Senior
    Work at office
    Local area
    Work from home

    Lambda

    San Francisco, CA
    13 hours ago
  • $140k - $185k

     ...alongside clinicians to make that possible. We’re a team of doctors, engineers, designers, researchers, and creatives building tools that...  ...in on-call and incident response: Improve operational reliability: Own parts of the production environment: Strengthen observability... 
    Senior
    Work at office
    Worldwide

    Dormont Manufacturing Co

    San Francisco, CA
    1 day ago
  • What you’ll do As a Senior Site Reliability Engineer, you’ll work closely with product teams in Spend to deliver and maintain scalable, reliable cloud infrastructure in support of key product initiatives. Aligned to the roadmap, you’ll lead on infrastructure design and... 
    Senior

    Airwallex-

    San Francisco, CA
    13 hours ago
  •  ...Responsibilities Lead and onboard services and teams to the reliability tenets. Establish and maintain Service Level Objectives (...  ...Science or equivalent. 6+ years of experience in Site Reliability Engineering, managing infrastructure and services at scale. History of... 
    Senior

    OutSystems, Inc.

    San Francisco, CA
    1 day ago
  •  ...acquisition, and Connor was a machine learning research engineer at Scale AI. The rest of our team comes from...  ...redefining go-to-market with state-of-the-art AI. As a Senior SRE, you'll tackle the scaling and reliability challenges that come with adding terabytes of data... 
    Senior

    Unify

    San Francisco, CA
    1 day ago
  •  ...values and enthusiasm for building a great culture and product, you will find a home at Fieldguide. About the Role As a Senior Site Reliability Engineer (SRE) at Fieldguide, you will be responsible for ensuring the reliability, scalability, and observability of our... 
    Senior
    Remote work
    Work from home
    Flexible hours

    Fieldguide

    San Francisco, CA
    4 days ago
  • $60 per hour

    Senior Site Reliability Engineer (Copy) Seattle Hybrid (Hybrid location). Full-time. About Us Supio is a trusted AI platform purpose-built for law firms, reshaping how data drives impactful outcomes. Our innovative approach blends technology with deep legal expertise,... 
    Senior
    Full time
    Work at office
    Flexible hours

    Bonfirevc

    San Francisco, CA
    1 day ago
  • # Senior Site Reliability EngineerHybrid - San Francisco**Our Mission & Values:** At Drata, we help companies earn and keep the trust of their...  ...Job Summary:**Drata's SRE team operates as both a central engineering function and an embedded reliability practice. You'll be... 
    Senior
    Work at office
    Immediate start
    Worldwide
    Monday to Friday
    Flexible hours

    Careers at Drata

    San Francisco, CA
    2 days ago
  • $166.9k - $225.9k

    Job Summary Drata's SRE team operates as both a central engineering function and an embedded reliability practice. You'll be part of a close-knit SRE team...  ...organization. What you’ll bring 6+ years of experience in Site Reliability Engineering, Cloud Engineering, or... 
    Senior
    Flexible hours

    Drata

    San Francisco, CA
    2 days ago
  •  ...don't get to the next level. While labs build the models of the future, we build...  ...model complex environments, deliver reliable results, and bridge the gap between models...  ...really hard to win. The Role | Senior Software Engineer As a Senior Software Engineer at Endeavor... 
    Senior

    Endeavor AI, Inc

    San Francisco, CA
    2 days ago
  • $165k - $225k

     ...it, and the SDF team is expanding to support the rapidly growing and changing Stellar ecosystem. SDF is looking for a Senior Site Reliability Engineer to help build and operate the foundation that powers our engineering teams. You’ll ensure the reliability and scalability... 
    Senior
    Temporary work
    Work at office
    Local area
    Worldwide
    Flexible hours

    Stellar

    San Francisco, CA
    4 days ago
  • $325k

    Engineering at Ivo Engineers At Ivo Are Inventors. Ivo Was First-to-market With An AI agent that lives in MS Word and edits...  ...expect us to hit our SLAs. We’re looking for an Senior or Staff Site level Reliability Engineer as part of the Infrastructure team to: Own uptime... 
    Senior
    Contract work

    Icehouseventures

    San Francisco, CA
    13 hours ago
  • $400k

     ...Mechanize RL Engineer Mechanize builds reinforcement learning environments that frontier AI labs use to train and evaluate their coding models. Learn more at mechanize.work. AI models have gotten good at narrow coding tasks but still fail at the complex, judgment... 
    Senior

    Mechanize

    San Francisco, CA
    5 days ago
  • $160k - $250k

    Multiply-Labs in San Francisco is seeking a Senior Robotics Software Engineer to lead the development of software powering automated manufacturing systems. The role involves design, core algorithm development, simulation, and cross-functional collaboration. Candidates... 
    Senior
    Flexible hours

    Multiply-Labs

    San Francisco, CA
    2 days ago
  • $220k - $235k

     ...are seeking a strategic, high-output Staff/Senior Staff SRE to define the future of our cloud platform and champion engineering excellence across Ironclad. In this role,...  ...leadership and strategic direction for the Site Reliability Engineering team and our broader Cloud... 
    Senior
    Full time
    Work at office

    jobr.pro

    San Francisco, CA
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Hyperbolic Labs - Senior Site Reliability Engineer. Be the first to apply!