Site Reliability Engineer (SRE)

$170k - $250k

Recruiting from Scratch

Site Reliability Engineer (SRE)

Location: San Francisco, CA / Palo Alto, CA
Company Stage of Funding: Growth-Stage AI Infrastructure Company ($80M Raised)
Office Type: Onsite (4 Days Per Week)
Salary: $170,000-$250,000 + Competitive Equity
Company Description

We're representing a rapidly growing AI infrastructure company building a next-generation GPU cloud platform for enterprises, startups, and AI researchers. Their platform provides flexible access to GPU compute through intelligent reservation, marketplace, and consumption models that help customers optimize performance, availability, and cost.

Backed by Sequoia Capital and Lightspeed with more than $80 million in funding, the company has achieved 6x revenue growth over the past year. As demand for AI infrastructure accelerates, they're investing heavily in reliability engineering to build the automation, observability, and platform infrastructure that powers their multi-cloud GPU marketplace at scale.
What You Will Do

Design, build, and own the observability platform supporting a large-scale, multi-cloud GPU infrastructure.
Develop monitoring, distributed tracing, dashboards, and alerting systems using modern observability tooling.
Define and implement SLIs, SLOs, and operational metrics across customer-facing APIs and internal platform services.
Build automation that eliminates repetitive operational work and improves platform reliability.
Develop production tooling in Python or Go for infrastructure management, health checks, reconciliation, and capacity optimization.
Design and maintain Infrastructure-as-Code using Terraform, Pulumi, and Kubernetes.
Improve platform resiliency through incident response, root cause analysis, and long-term reliability improvements.
Partner closely with Platform, Product, and Engineering teams to ensure new services are designed for operational excellence.
Help establish infrastructure engineering standards, reliability practices, and operational processes as the company scales.
Participate in production on-call rotations while continuously reducing operational burden through automation.

Ideal Background

3-10 years of experience in Site Reliability Engineering, Production Engineering, Infrastructure Engineering, or Platform Engineering.
Strong experience building production automation and operational tooling rather than solely responding to incidents.
Proven experience designing and operating large-scale Kubernetes environments.
Strong cloud infrastructure experience across AWS, GCP, Azure, or multi-cloud environments.
Experience designing distributed systems with a strong understanding of networking fundamentals.
Proficiency with Python and/or Go for building production-grade infrastructure tooling.
Experience implementing observability platforms using Prometheus, Grafana, OpenTelemetry, or similar technologies.
Strong understanding of Linux systems, containers, Docker, and production operations.
Excellent communication skills with the ability to collaborate across engineering teams.

Preferred

Experience supporting AI infrastructure, GPU clusters, machine learning platforms, or accelerated compute environments.
Familiarity with Terraform, Pulumi, Infrastructure-as-Code, and cloud automation.
Experience designing reliability standards, operational playbooks, and incident management processes.
Background at high-growth startups or major cloud infrastructure organizations.
Strong understanding of distributed systems, capacity planning, and performance optimization.
Experience building greenfield infrastructure rather than maintaining legacy systems.
Passion for automation, reducing operational toil, and continuously improving developer experience.
Ability to thrive in fast-paced startup environments with significant ownership and autonomy.

Compensation and Benefits

Base salary: $170,000-$250,000.
Competitive equity package.
Visa transfer sponsorship available.
Four-day onsite schedule across San Francisco and Palo Alto offices (all engineers collaborate in Palo Alto on Mondays).
Opportunity to help define the reliability and operational foundation of one of the fastest-growing AI infrastructure platforms.
Significant ownership over observability, automation, and production infrastructure.
Work alongside experienced engineers solving large-scale distributed systems and cloud infrastructure challenges.
Join a high-growth, venture-backed company building the infrastructure powering the next generation of AI applications.

Apply

Vacancy posted 3 days ago

Similar jobs that could be interesting for youBased on the Site Reliability Engineer (SRE) in San Francisco, CA vacancy

Site Reliability Engineer (SRE)
$170k - $230k
...Site Reliability Engineer (SRE) Palo Alto / San Francisco Bay Area About Mithril Mithril is an AI infrastructure platform built to make GPU compute more accessible and affordable for the world's leading enterprises, AI startups, and the AI research community,...
Suggested
Work at office
Local area
1 day per week
Mithril
San Francisco, CA
1 day ago
Site Reliability Engineer (SRE)
...Site Reliability Engineer (SRE) FLUIX is building the AI operating system that plans, designs, and optimizes AI infrastructure. We are based in Silicon Valley. We specialize in providing AI-driven solutions for data centers and power providers, leveraging cutting-edge...
Suggested
Work at office
Weekend work
Fluix AI
San Francisco, CA
1 day ago
Senior Site Reliability Engineer (SRE) - AI Inftastructure
$300k
...experimentation, full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the reliability, performance, and... .... Skills / Must Have: ~7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting...
Suggested
Permanent employment
San Francisco, CA
more than 2 months ago
Site Reliability Engineer
...Site Reliability Engineer We are looking for a dynamic engineer to join our rapidly growing SRE team. As an SRE, you will report to our VP of Technical Operations and be responsible for operating an extremely high performance and scalable, low latency platform built...
Suggested
Relocation package
1872 Consulting
San Francisco, CA
1 day ago
Site Reliability Engineer
...would. We're a small team of former Google and Stripe engineers, including the founding team of Google Wallet,... ...Role We're looking for a skilled and passionate Site Reliability Engineer to join our team. As a SRE, you'll be responsible for the reliability,...
Suggested
Remote work
1 day per week
Runloop AI, Inc
San Francisco, CA
1 day ago
Site Reliability Engineer
$150k
...Site Reliability Engineer San Francisco, CA About The Role We are seeking an experienced Site Reliability Engineer (SRE) with a strong focus on DevSecOps to join our growing engineering team. In this role, you will oversee and maintain the reliability, security...
VantageScore®
San Francisco, CA
2 days ago
Site Reliability Engineer
$100k - $170k
...Site Reliability Engineer Houston; San Francisco; Seattle About Nscale Nscale is the GPU cloud built for AI. We run high-performance, cost... ...that makes AI work. The Role This is a career-level SRE role for someone who wants to own systems, not just watch them...
Flexible hours
Shift work
Nscale
San Francisco, CA
1 day ago
Senior Site Reliability Engineer
...About the job Senior Site Reliability Engineer About the Company Stellar is a decentralized, public blockchain that gives developers the tools... ...of working in cloud-based systems operations, as a SRE or DevOps engineer. First-hand experience with configuration...
TechChain Talent
San Francisco, CA
6 days ago
Senior Site Reliability Engineer
$210k - $240k
Join to apply for the Senior Site Reliability Engineer role at Alembic Technologies This range is provided by Alembic Technologies. Your actual pay... ...We’re looking for an experienced Site Reliability Engineer (SRE) to help us scale our platform with reliability,...
Full time
Alembic Technologies
San Francisco, CA
3 days ago
Site Reliability Engineer
...valued at $10 billion. We work in‑person five days a week in our new San Francisco headquarters. About the Role As a Site Reliability Engineer (SRE) at Mercor, you’ll own production reliability across our most critical systems, partnering directly with infrastructure...
Mercor
San Francisco, CA
3 days ago
Staff Site Reliability Engineer
...Staff Site Reliability Engineer (SRE) Location: San Francisco, CA Job Responsibilities As our Staff SRE, you'll be the primary expert responsible for our entire compute ecosystem. Your key responsibilities will include: As a Staff SRE, you'll operate at the...
United IT
San Francisco, CA
1 day ago
Manager, Site Reliability Engineering
$204k - $281k
...all in on this mission. If you are too, let’s talk. Manager, Site Reliability Engineering San Francisco, California Okta authenticates, authorizes... ...teams across the organization. Accelerate the velocity of SRE and product engineering by developing powerful tooling, intuitive...
Permanent employment
Worldwide
Flexible hours
Okta, Inc.
San Francisco, CA
1 day ago
Director, Site Reliability Engineering
$205k - $305k
...Director Of Site Reliability Engineering Interested in working on cutting-edge blockchain technology and creating equitable access to the global... ...Site Reliability Engineering to lead a small, high-leverage SRE team and help shape how engineering teams own, operate, and...
Temporary work
Work at office
Local area
Worldwide
Flexible hours
Stellar
San Francisco, CA
11 days ago
Senior Manager, Site Reliability Engineering - Infrastructure Platform
...scale the service with great people and reliable, cost-effective, and efficient infrastructure... ...org and various initiatives across SRE & Infrastructure organization. Lead the... ...partnership with architects and product engineering Build a world-class observability platform...
Gravity Engineering Services Pvt Ltd.
San Francisco, CA
1 day ago
Senior Site Reliability Engineer
...founders with PhDs in AI, Math, and Computer Science - is poised to redefine computing. About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure operate with exceptional reliability, performance, and...
Hyperbolic Labs
San Francisco, CA
1 day ago
Site Reliability Engineer
$152.5k - $219.2k
...maintain automation solutions that improve the reliability, scalability, and operational efficiency... ...cloud environments. Partner with other engineering teams, product management, and business... ...~2+ years of experience in Site Reliability Engineering, DevOps, Infrastructure...
Full time
Temporary work
Local area
Worldwide
Flexible hours
Webex Events (formerly Socio)
San Francisco, CA
5 days ago
Site Reliability Engineer II
$86k - $105k
...generation of application infrastructure and to be responsible for reliability, automation and scalability using and the latest best... ...certifications. Minimum of 2 years prior DevOps, software engineering or related experience. Must be able to work different schedules...
Hourly pay
Work at office
Immediate start
Visa sponsorship
Work visa
Flexible hours
Early Warning Services
San Francisco, CA
3 days ago
Site Reliability Engineer
$200k - $300k
...Site Reliability Engineer Title of Role: Site Reliability Engineer Location: San Francisco, onsite Company Stage of Funding: Venture Round - Healthcare, AI Office Type: Onsite Salary: $200K-$300K Company Description We're representing a dynamic...
Work at office
Recruiting from Scratch
San Francisco, CA
1 day ago
Site Reliability Engineer II
$98.58k - $138.02k
...Site Reliability Engineer II Restaurant365 is a SaaS company disrupting the restaurant industry! Our cloud-based platform provides a unique, centralized solution for accounting and back-office operations for restaurants. Restaurant365's culture is focused on empowering...
Work at office
Restaurant365
San Francisco, CA
1 day ago
Senior Site Reliability Engineer
$175k - $250k
...0/yr Job Title: Senior Cloud Infrastructure Engineer Location: San Francisco, CA. Remote unavailable. Modality: On-Site only. Must live within commuting distance of... ...while ensuring scalability, performance, and reliability across environments. What You’ll Do Design,...
Full time
Remote work
Relocation
Relocation package
The Recruiting Guy
San Francisco, CA
3 days ago
Site Reliability Engineer
$160k - $230k
...As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a software engineer that applies sound engineering principles, operational discipline...
Remote job
Full time
Work experience placement
Together AI
San Francisco, CA
more than 2 months ago
Senior Site Reliability Engineer- San Francisco, CA, the US
...Job Description Job Description Senior Site Reliability Engineer (Payments Infrastructure) Kody is seeking a Senior Site Reliability Engineer... ..., and uptime requirements. ~ Strong knowledge of SRE principles, including SLOs, SLIs, error budgets, incident management...
Kody
San Francisco, CA
4 days ago
Senior Software Engineer - Site Reliability Engineering
...Udaip Cloud-Based Data And Ai Platform Engineer At U.S. Bank, we're on a journey to do our best. Helping the customers and businesses we serve to make better and smarter financial decisions and enabling the communities we support to grow and succeed. We believe it...
Temporary work
Work experience placement
Phenom People
San Francisco, CA
1 day ago
Senior Software Engineer, Site Reliability Engineering
$210.8k - $272.8k
About Thumbtack Thumbtack helps millions of people confidently care for their homes. About the Site Reliability Engineering Team The Site Reliability Engineering team focuses on creating and maintaining a reliable, secure, and scalable platform vital for a seamless user...
Local area
Thumbtack
San Francisco, CA
2 days ago
Site Reliability Engineering
...Job Description Forhyre is looking for engineers who can bring unique perspectives and innovative... ...practices while building a culture of reliability and observability Engage in and... ...Serve as subject matter expert in an SRE mindset, best practices, and cloud-native...
Forhyre
San Francisco, CA
6 days ago
Staff Site Reliability Engineer - Kubernetes
$194k - $267k
...more than once, automate it” and who can rapidly self-educate on new concepts and tools. Position Overview: The Site Reliability Engineer (SRE) will play a key role in building and managing Kubernetes platforms that support cloud-native applications and services...
Permanent employment
Work at office
Local area
Worldwide
Flexible hours
Okta
San Francisco, CA
more than 2 months ago
Staff Site Reliability Engineer - Observability GCP
$194k - $267k
...'s talk. We are seeking a highly technical Observability Site Reliability Engineer with a specialty in Google Cloud, to own and expand our Observability... ..., scalable Observability Platform that enables our SRE teams and business partners. You will treat infrastructure...
Permanent employment
Local area
Worldwide
Flexible hours
Okta
San Francisco, CA
a month ago
Staff Site Reliability Engineer - Splunk
$194k - $267k
...We are seeking a highly technical Staff Observability Site Reliability Engineer with a specialty in Splunk to own and evolve our Splunk ecosystem... ..., scalable Observability Platform that enables our SRE teams and business partners. You will treat infrastructure...
Permanent employment
Work at office
Local area
Worldwide
Flexible hours
Okta
San Francisco, CA
18 days ago
Staff TDI Site Reliability Engineer, Okta Federal
$174k - $239k
...we partner across functions to drive scale, reliability, and innovation through technology. The Staff Site Reliability Engineer Opportunity Okta Federal, Inc. is looking for... ...internal customer service and advocate for SRE and DevOps practices across teams. Build...
Work experience placement
Local area
Worldwide
Flexible hours
Okta
San Francisco, CA
9 days ago
Site Reliability Engineer (Network)
$157k - $239k
...Job Description Job Description Wanna join the adventure? As a Site Reliability Engineer with strong networking skills in our Cloud Infrastructure (SRE) team, you help the team own the networks that keep Loft running: cloud networking, VPN and site-to-site connectivity...
Temporary work
Work at office
Relocation package
Flexible hours
Loft Orbital Solutions
San Francisco, CA
6 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Site Reliability Engineer (SRE). Be the first to apply!