Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Site Reliability Engineer (SRE)

$170k - $250k

Recruiting from Scratch

Site Reliability Engineer (SRE)

Location: San Francisco, CA / Palo Alto, CA
Company Stage of Funding: Growth-Stage AI Infrastructure Company ($80M Raised)
Office Type: Onsite (4 Days Per Week)
Salary: $170,000-$250,000 + Competitive Equity
Company Description

We're representing a rapidly growing AI infrastructure company building a next-generation GPU cloud platform for enterprises, startups, and AI researchers. Their platform provides flexible access to GPU compute through intelligent reservation, marketplace, and consumption models that help customers optimize performance, availability, and cost.

Backed by Sequoia Capital and Lightspeed with more than $80 million in funding, the company has achieved 6x revenue growth over the past year. As demand for AI infrastructure accelerates, they're investing heavily in reliability engineering to build the automation, observability, and platform infrastructure that powers their multi-cloud GPU marketplace at scale.
What You Will Do
  • Design, build, and own the observability platform supporting a large-scale, multi-cloud GPU infrastructure.
  • Develop monitoring, distributed tracing, dashboards, and alerting systems using modern observability tooling.
  • Define and implement SLIs, SLOs, and operational metrics across customer-facing APIs and internal platform services.
  • Build automation that eliminates repetitive operational work and improves platform reliability.
  • Develop production tooling in Python or Go for infrastructure management, health checks, reconciliation, and capacity optimization.
  • Design and maintain Infrastructure-as-Code using Terraform, Pulumi, and Kubernetes.
  • Improve platform resiliency through incident response, root cause analysis, and long-term reliability improvements.
  • Partner closely with Platform, Product, and Engineering teams to ensure new services are designed for operational excellence.
  • Help establish infrastructure engineering standards, reliability practices, and operational processes as the company scales.
  • Participate in production on-call rotations while continuously reducing operational burden through automation.
Ideal Background
  • 3-10 years of experience in Site Reliability Engineering, Production Engineering, Infrastructure Engineering, or Platform Engineering.
  • Strong experience building production automation and operational tooling rather than solely responding to incidents.
  • Proven experience designing and operating large-scale Kubernetes environments.
  • Strong cloud infrastructure experience across AWS, GCP, Azure, or multi-cloud environments.
  • Experience designing distributed systems with a strong understanding of networking fundamentals.
  • Proficiency with Python and/or Go for building production-grade infrastructure tooling.
  • Experience implementing observability platforms using Prometheus, Grafana, OpenTelemetry, or similar technologies.
  • Strong understanding of Linux systems, containers, Docker, and production operations.
  • Excellent communication skills with the ability to collaborate across engineering teams.
Preferred
  • Experience supporting AI infrastructure, GPU clusters, machine learning platforms, or accelerated compute environments.
  • Familiarity with Terraform, Pulumi, Infrastructure-as-Code, and cloud automation.
  • Experience designing reliability standards, operational playbooks, and incident management processes.
  • Background at high-growth startups or major cloud infrastructure organizations.
  • Strong understanding of distributed systems, capacity planning, and performance optimization.
  • Experience building greenfield infrastructure rather than maintaining legacy systems.
  • Passion for automation, reducing operational toil, and continuously improving developer experience.
  • Ability to thrive in fast-paced startup environments with significant ownership and autonomy.
Compensation and Benefits
  • Base salary: $170,000-$250,000.
  • Competitive equity package.
  • Visa transfer sponsorship available.
  • Four-day onsite schedule across San Francisco and Palo Alto offices (all engineers collaborate in Palo Alto on Mondays).
  • Opportunity to help define the reliability and operational foundation of one of the fastest-growing AI infrastructure platforms.
  • Significant ownership over observability, automation, and production infrastructure.
  • Work alongside experienced engineers solving large-scale distributed systems and cloud infrastructure challenges.
  • Join a high-growth, venture-backed company building the infrastructure powering the next generation of AI applications.
Vacancy posted 5 days ago
Similar jobs that could be interesting for youBased on the Site Reliability Engineer (SRE) in San Francisco, CA vacancy
  •  ...Site Reliability Engineer (SRE) FLUIX is building the AI operating system that plans, designs, and optimizes AI infrastructure. We are based in Silicon Valley. We specialize in providing AI-driven solutions for data centers and power providers, leveraging cutting-edge... 
    Suggested
    Work at office
    Weekend work

    Fluix AI

    San Francisco, CA
    3 days ago
  • $170k - $230k

     ...Site Reliability Engineer (SRE) Palo Alto / San Francisco Bay Area About Mithril Mithril is an AI infrastructure platform built to make GPU compute more accessible and affordable for the world's leading enterprises, AI startups, and the AI research community,... 
    Suggested
    Work at office
    Local area
    1 day per week

    Mithril

    San Francisco, CA
    5 days ago
  • We are seeking a Sr. Site Reliability Engineer to join our team and run critical infrastructure for our blockchain and web applications. You’ll learn...  ...tools to streamline development processes. DevOps Engineer/SRE Transitioning to Blockchain An experienced DevOps Engineer... 
    Suggested
    Remote job

    Blockchain Works

    San Francisco, CA
    21 days ago
  • $300k

     ...experimentation, full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the reliability, performance, and...  .... Skills / Must Have: 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting... 
    Suggested

    Hamilton Barnes Associates Limited

    San Francisco, CA
    6 days ago
  • CloudDevs: Senior Web site Reliability Engineer (SRE) CloudDevs works with fast-moving, venture-backed startups throughout the US. We’re constructing a pool of world-class Web site Reliability Engineers for present roles and for upcoming alternatives. You’ll both be positioned... 
    Suggested

    The10minutecareersolution

    San Francisco, CA
    6 days ago
  • The Consulting Solutions is seeking an experienced Senior / Staff Engineer for our SRE, InfraSec team in Seattle. The role involves leading the security of cloud-based infrastructure, mentoring a team of SREs, and collaborating with other engineering teams to ensure high... 
    Remote job

    The Consulting Solutions

    San Francisco, CA
    6 days ago
  • A leading language learning platform is seeking an experienced SRE Engineer to ensure the reliability and resilience of their infrastructure. Responsibilities include leading incident response, improving observability, and collaborating with various teams to enhance platform... 

    Speak

    San Francisco, CA
    2 days ago
  •  ...help shape the future of healthcare, we’d love to meet you. About the role We’re hiring an SRE to join our engineering team at Plenful and take ownership of the reliability and performance of the systems that power our product. You’ll work across our distributed... 
    Work at office
    Remote work
    Flexible hours
    2 days per week

    Plenful

    San Francisco, CA
    2 days ago
  • Invisible Technologies is looking for a Principal Software Engineer (SRE/DevOps) to work remotely. The ideal candidate will possess dual expertise in application engineering and infrastructure, contributing to a variety of technical initiatives. This role includes overseeing... 
    Remote job

    Invisible Technologies

    San Francisco, CA
    6 days ago
  • $200k - $240k

    A leading AI startup in San Francisco is seeking a Staff Software Engineer to help define the future of incident response by creating an autonomous AI SRE. You will design complex data flows, drive product direction, and maintain high engineering standards across the stack... 

    Jack & Jill/External ATS

    San Francisco, CA
    3 days ago
  • Stellar is seeking a Director of Site Reliability Engineering to lead a distributed SRE team and shape service operations. This role is crucial for improving the reliability and operational maturity of services within the Stellar ecosystem. The ideal candidate will have... 

    Stellar

    San Francisco, CA
    3 days ago
  •  ...SRE Location: San Francisco, CA (5 Days In-Office) You are the infrastructure...  ...treatment. What We Look for in a Great Engineer You have the intensity and...  ...feature release while maintaining the highest reliability. DevX Support: Support Developer Experience... 
    Work at office

    Latent

    San Francisco, CA
    5 days ago
  • $150k

     ...Site Reliability Engineer San Francisco, CA About The Role We are seeking an experienced Site Reliability Engineer (SRE) with a strong focus on DevSecOps to join our growing engineering team. In this role, you will oversee and maintain the reliability, security... 

    VantageScore®

    San Francisco, CA
    4 days ago
  •  ...Site Reliability Engineer We are looking for a dynamic engineer to join our rapidly growing SRE team. As an SRE, you will report to our VP of Technical Operations and be responsible for operating an extremely high performance and scalable, low latency platform built... 
    Relocation package

    1872 Consulting

    San Francisco, CA
    3 days ago
  •  ...About the job Senior Site Reliability Engineer About the Company Stellar is a decentralized, public blockchain that gives developers the tools...  ...of working in cloud-based systems operations, as a SRE or DevOps engineer. ~ First-hand experience with configuration... 

    TechChain Talent

    San Francisco, CA
    4 days ago
  •  ...highest ROI healthcare workflows. We're actively hiring as we continue to scale. About the role We're hiring a Site Reliability Engineer (SRE) to ensure the reliability, performance, and scalability of Plenful's production systems as we continue to grow. This... 
    Work at office
    Remote work
    Flexible hours
    2 days per week

    Plenful

    San Francisco, CA
    3 days ago
  •  ...AI agents. The Role: You'll be the infrastructure and reliability engineer on the Data Replication team - a full-stack product team...  ...Need: ~7+ years in infrastructure, platform engineering, SRE, or DevOps. ~ Hands-on ownership of Kubernetes, Helm, and Terraform... 
    Work at office
    Local area
    Remote work
    Flexible hours

    Airbyte

    San Francisco, CA
    3 days ago
  •  ...that significantly outperforms individual engineers. We combine language models with human...  ...: We are seeking an experienced Site Reliability Engineer to join our Platform Engineering...  ...scale our services reliably. As an SRE at CodeRabbit, you'll be responsible... 

    CodeRabbit

    San Francisco, CA
    5 days ago
  •  ...SRE @ Clay In this role, you'll join our growing infrastructure team in building and fine-tuning our infrastructure to keep our...  ...to ensure we achieve the right balance of developer velocity, reliability and performance, and cost efficiency. What You'll Bring... 

    clay.global

    San Francisco, CA
    2 days ago
  • $200k - $250k

    Overview Job Title: SRE & Data Engineer Location: Bay Area, CA, 3 days a week onsite Job Type: Founding Level SWE, Full Time Salary: Founders...  ...infrastructure, ensuring consistent, scalable, and reliable systems that support all engineering projects and data ingestion... 
    Full time
    H1b
    3 days per week

    SherlockTalent

    San Francisco, CA
    2 days ago
  • $166.9k - $225.9k

    Job Summary Drata's SRE team operates as both a central engineering function and an embedded reliability practice. You'll be part of a close-knit SRE team where you grow your...  ...What you’ll bring 6+ years of experience in Site Reliability Engineering, Cloud Engineering, or... 
    Flexible hours

    Drata

    San Francisco, CA
    5 days ago
  •  ...encourage you to apply. The Role As a Senior Platform Engineer, you are a champion for DevOps and SRE culture and industry best practice within Megaport....  ...met. What You Will Be Doing Improving production reliability and system resilience within an SRE scoped team... 
    Flexible hours

    Megaport

    Brisbane, CA
    4 days ago
  • $140k - $220k

    About the Job You’ll own reliability and operational excellence for Pylon's production systems...  ...'ll build tooling that makes the entire engineering team more effective, establish on-call...  ...not a pure ops role. At Pylon, we believe SRE work should be a maximum of 50%... 

    Pylon

    San Francisco, CA
    2 days ago
  • # Senior Site Reliability EngineerHybrid - San Francisco**Our Mission & Values:** At Drata, we help companies earn and...  ...employee stories, and career news.**Job Summary:**Drata's SRE team operates as both a central engineering function and an embedded reliability practice. You'... 
    Work at office
    Immediate start
    Worldwide
    Monday to Friday
    Flexible hours

    Careers at Drata

    San Francisco, CA
    5 days ago
  • $165k - $241.4k

     ...FedRAMP offering. Your Impact The FedRAMP SRE team is focused on our Federal region’s...  ...effective. We’re looking for talented engineers with a software or operations background...  ...development teams to ensure the reliability, performance and security of our infrastructure... 
    Full time
    Temporary work
    Work at office
    Flexible hours
    1 day per week

    Cisco Systems, Inc.

    San Francisco, CA
    6 days ago
  • Sr. Site Reliability Engineer Job type: Full Time · Department: Platform · Work type: On-Site San Francisco, California, United States (Remote)...  ...operating production infrastructure, including 3+ years in a senior SRE, platform, or staff infrastructure role Deep Kubernetes... 
    Full time
    Remote work

    Neara

    San Francisco, CA
    5 days ago
  • $60 per hour

    Senior Site Reliability Engineer (Copy) Seattle Hybrid (Hybrid location). Full-time. About Us Supio is a trusted AI platform purpose-built for law...  ...time zones. You’re a Great Fit If You Have 3-6+ years in SRE, DevOps, or infrastructure roles with production ownership.... 
    Full time
    Work at office
    Flexible hours

    Bonfirevc

    San Francisco, CA
    4 days ago
  • For more information, please read ourSenior Site Reliability Engineer page is loaded## Senior Site Reliability Engineerlocations: US - San Francisco...  ...Engineering Function Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering... 
    Immediate start
    Remote work
    Worldwide

    OutSystems Inc.

    San Francisco, CA
    4 days ago
  • What you’ll do As a Senior Site Reliability Engineer, you’ll work closely with product teams in Spend to deliver and maintain scalable, reliable cloud...  ...are not mandatory. Minimum qualifications 6+ years in an SRE, DevOps, or infrastructure-focused engineering role.... 

    Airwallex-

    San Francisco, CA
    3 days ago
  • the company | Site Reliability Engineer | San Francisco, CA (Hybrid) | Full-time the company is a no-code data workflow automation tool that helps...  ...and our infrastructure runs on AWS. We're looking for an SRE that's passionate about observability and keeping systems healthy... 
    Full time

    United States Digital Space LLC

    San Francisco, CA
    6 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Site Reliability Engineer (SRE). Be the first to apply!