Site Reliability Engineer (SRE)
$170k - $250kRecruiting from Scratch
Site Reliability Engineer (SRE) Location: San Francisco, CA / Palo Alto, CA
Company Stage of Funding: Growth-Stage AI Infrastructure Company ($80M Raised)
Office Type: Onsite (4 Days Per Week)
Salary: $170,000-$250,000 + Competitive Equity
Company Description We're representing a rapidly growing AI infrastructure company building a next-generation GPU cloud platform for enterprises, startups, and AI researchers. Their platform provides flexible access to GPU compute through intelligent reservation, marketplace, and consumption models that help customers optimize performance, availability, and cost. Backed by Sequoia Capital and Lightspeed with more than $80 million in funding, the company has achieved 6x revenue growth over the past year. As demand for AI infrastructure accelerates, they're investing heavily in reliability engineering to build the automation, observability, and platform infrastructure that powers their multi-cloud GPU marketplace at scale.
What You Will Do
Company Stage of Funding: Growth-Stage AI Infrastructure Company ($80M Raised)
Office Type: Onsite (4 Days Per Week)
Salary: $170,000-$250,000 + Competitive Equity
Company Description We're representing a rapidly growing AI infrastructure company building a next-generation GPU cloud platform for enterprises, startups, and AI researchers. Their platform provides flexible access to GPU compute through intelligent reservation, marketplace, and consumption models that help customers optimize performance, availability, and cost. Backed by Sequoia Capital and Lightspeed with more than $80 million in funding, the company has achieved 6x revenue growth over the past year. As demand for AI infrastructure accelerates, they're investing heavily in reliability engineering to build the automation, observability, and platform infrastructure that powers their multi-cloud GPU marketplace at scale.
What You Will Do
- Design, build, and own the observability platform supporting a large-scale, multi-cloud GPU infrastructure.
- Develop monitoring, distributed tracing, dashboards, and alerting systems using modern observability tooling.
- Define and implement SLIs, SLOs, and operational metrics across customer-facing APIs and internal platform services.
- Build automation that eliminates repetitive operational work and improves platform reliability.
- Develop production tooling in Python or Go for infrastructure management, health checks, reconciliation, and capacity optimization.
- Design and maintain Infrastructure-as-Code using Terraform, Pulumi, and Kubernetes.
- Improve platform resiliency through incident response, root cause analysis, and long-term reliability improvements.
- Partner closely with Platform, Product, and Engineering teams to ensure new services are designed for operational excellence.
- Help establish infrastructure engineering standards, reliability practices, and operational processes as the company scales.
- Participate in production on-call rotations while continuously reducing operational burden through automation.
- 3-10 years of experience in Site Reliability Engineering, Production Engineering, Infrastructure Engineering, or Platform Engineering.
- Strong experience building production automation and operational tooling rather than solely responding to incidents.
- Proven experience designing and operating large-scale Kubernetes environments.
- Strong cloud infrastructure experience across AWS, GCP, Azure, or multi-cloud environments.
- Experience designing distributed systems with a strong understanding of networking fundamentals.
- Proficiency with Python and/or Go for building production-grade infrastructure tooling.
- Experience implementing observability platforms using Prometheus, Grafana, OpenTelemetry, or similar technologies.
- Strong understanding of Linux systems, containers, Docker, and production operations.
- Excellent communication skills with the ability to collaborate across engineering teams.
- Experience supporting AI infrastructure, GPU clusters, machine learning platforms, or accelerated compute environments.
- Familiarity with Terraform, Pulumi, Infrastructure-as-Code, and cloud automation.
- Experience designing reliability standards, operational playbooks, and incident management processes.
- Background at high-growth startups or major cloud infrastructure organizations.
- Strong understanding of distributed systems, capacity planning, and performance optimization.
- Experience building greenfield infrastructure rather than maintaining legacy systems.
- Passion for automation, reducing operational toil, and continuously improving developer experience.
- Ability to thrive in fast-paced startup environments with significant ownership and autonomy.
- Base salary: $170,000-$250,000.
- Competitive equity package.
- Visa transfer sponsorship available.
- Four-day onsite schedule across San Francisco and Palo Alto offices (all engineers collaborate in Palo Alto on Mondays).
- Opportunity to help define the reliability and operational foundation of one of the fastest-growing AI infrastructure platforms.
- Significant ownership over observability, automation, and production infrastructure.
- Work alongside experienced engineers solving large-scale distributed systems and cloud infrastructure challenges.
- Join a high-growth, venture-backed company building the infrastructure powering the next generation of AI applications.
Vacancy posted 5 days ago
Similar jobs that could be interesting for youBased on the Site Reliability Engineer (SRE) in San Francisco, CA vacancy
- ...Site Reliability Engineer (SRE) FLUIX is building the AI operating system that plans, designs, and optimizes AI infrastructure. We are based in Silicon Valley. We specialize in providing AI-driven solutions for data centers and power providers, leveraging cutting-edge...SuggestedWork at officeWeekend work
$170k - $230k
...Site Reliability Engineer (SRE) Palo Alto / San Francisco Bay Area About Mithril Mithril is an AI infrastructure platform built to make GPU compute more accessible and affordable for the world's leading enterprises, AI startups, and the AI research community,...SuggestedWork at officeLocal area1 day per week- We are seeking a Sr. Site Reliability Engineer to join our team and run critical infrastructure for our blockchain and web applications. You’ll learn... ...tools to streamline development processes. DevOps Engineer/SRE Transitioning to Blockchain An experienced DevOps Engineer...SuggestedRemote job
$300k
...experimentation, full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the reliability, performance, and... .... Skills / Must Have: 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting...Suggested- CloudDevs: Senior Web site Reliability Engineer (SRE) CloudDevs works with fast-moving, venture-backed startups throughout the US. We’re constructing a pool of world-class Web site Reliability Engineers for present roles and for upcoming alternatives. You’ll both be positioned...Suggested
- The Consulting Solutions is seeking an experienced Senior / Staff Engineer for our SRE, InfraSec team in Seattle. The role involves leading the security of cloud-based infrastructure, mentoring a team of SREs, and collaborating with other engineering teams to ensure high...Remote job
- A leading language learning platform is seeking an experienced SRE Engineer to ensure the reliability and resilience of their infrastructure. Responsibilities include leading incident response, improving observability, and collaborating with various teams to enhance platform...
- ...help shape the future of healthcare, we’d love to meet you. About the role We’re hiring an SRE to join our engineering team at Plenful and take ownership of the reliability and performance of the systems that power our product. You’ll work across our distributed...Work at officeRemote workFlexible hours2 days per week
- Invisible Technologies is looking for a Principal Software Engineer (SRE/DevOps) to work remotely. The ideal candidate will possess dual expertise in application engineering and infrastructure, contributing to a variety of technical initiatives. This role includes overseeing...Remote job
$200k - $240k
A leading AI startup in San Francisco is seeking a Staff Software Engineer to help define the future of incident response by creating an autonomous AI SRE. You will design complex data flows, drive product direction, and maintain high engineering standards across the stack...- Stellar is seeking a Director of Site Reliability Engineering to lead a distributed SRE team and shape service operations. This role is crucial for improving the reliability and operational maturity of services within the Stellar ecosystem. The ideal candidate will have...
- ...SRE Location: San Francisco, CA (5 Days In-Office) You are the infrastructure... ...treatment. What We Look for in a Great Engineer You have the intensity and... ...feature release while maintaining the highest reliability. DevX Support: Support Developer Experience...Work at office
$150k
...Site Reliability Engineer San Francisco, CA About The Role We are seeking an experienced Site Reliability Engineer (SRE) with a strong focus on DevSecOps to join our growing engineering team. In this role, you will oversee and maintain the reliability, security...- ...Site Reliability Engineer We are looking for a dynamic engineer to join our rapidly growing SRE team. As an SRE, you will report to our VP of Technical Operations and be responsible for operating an extremely high performance and scalable, low latency platform built...Relocation package
- ...About the job Senior Site Reliability Engineer About the Company Stellar is a decentralized, public blockchain that gives developers the tools... ...of working in cloud-based systems operations, as a SRE or DevOps engineer. ~ First-hand experience with configuration...
- ...highest ROI healthcare workflows. We're actively hiring as we continue to scale. About the role We're hiring a Site Reliability Engineer (SRE) to ensure the reliability, performance, and scalability of Plenful's production systems as we continue to grow. This...Work at officeRemote workFlexible hours2 days per week
- ...AI agents. The Role: You'll be the infrastructure and reliability engineer on the Data Replication team - a full-stack product team... ...Need: ~7+ years in infrastructure, platform engineering, SRE, or DevOps. ~ Hands-on ownership of Kubernetes, Helm, and Terraform...Work at officeLocal areaRemote workFlexible hours
- ...that significantly outperforms individual engineers. We combine language models with human... ...: We are seeking an experienced Site Reliability Engineer to join our Platform Engineering... ...scale our services reliably. As an SRE at CodeRabbit, you'll be responsible...
- ...SRE @ Clay In this role, you'll join our growing infrastructure team in building and fine-tuning our infrastructure to keep our... ...to ensure we achieve the right balance of developer velocity, reliability and performance, and cost efficiency. What You'll Bring...
$200k - $250k
Overview Job Title: SRE & Data Engineer Location: Bay Area, CA, 3 days a week onsite Job Type: Founding Level SWE, Full Time Salary: Founders... ...infrastructure, ensuring consistent, scalable, and reliable systems that support all engineering projects and data ingestion...Full timeH1b3 days per week$166.9k - $225.9k
Job Summary Drata's SRE team operates as both a central engineering function and an embedded reliability practice. You'll be part of a close-knit SRE team where you grow your... ...What you’ll bring 6+ years of experience in Site Reliability Engineering, Cloud Engineering, or...Flexible hours- ...encourage you to apply. The Role As a Senior Platform Engineer, you are a champion for DevOps and SRE culture and industry best practice within Megaport.... ...met. What You Will Be Doing Improving production reliability and system resilience within an SRE scoped team...Flexible hours
$140k - $220k
About the Job You’ll own reliability and operational excellence for Pylon's production systems... ...'ll build tooling that makes the entire engineering team more effective, establish on-call... ...not a pure ops role. At Pylon, we believe SRE work should be a maximum of 50%...- # Senior Site Reliability EngineerHybrid - San Francisco**Our Mission & Values:** At Drata, we help companies earn and... ...employee stories, and career news.**Job Summary:**Drata's SRE team operates as both a central engineering function and an embedded reliability practice. You'...Work at officeImmediate startWorldwideMonday to FridayFlexible hours
$165k - $241.4k
...FedRAMP offering. Your Impact The FedRAMP SRE team is focused on our Federal region’s... ...effective. We’re looking for talented engineers with a software or operations background... ...development teams to ensure the reliability, performance and security of our infrastructure...Full timeTemporary workWork at officeFlexible hours1 day per week- Sr. Site Reliability Engineer Job type: Full Time · Department: Platform · Work type: On-Site San Francisco, California, United States (Remote)... ...operating production infrastructure, including 3+ years in a senior SRE, platform, or staff infrastructure role Deep Kubernetes...Full timeRemote work
$60 per hour
Senior Site Reliability Engineer (Copy) Seattle Hybrid (Hybrid location). Full-time. About Us Supio is a trusted AI platform purpose-built for law... ...time zones. You’re a Great Fit If You Have 3-6+ years in SRE, DevOps, or infrastructure roles with production ownership....Full timeWork at officeFlexible hours- For more information, please read ourSenior Site Reliability Engineer page is loaded## Senior Site Reliability Engineerlocations: US - San Francisco... ...Engineering Function Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering...Immediate startRemote workWorldwide
- What you’ll do As a Senior Site Reliability Engineer, you’ll work closely with product teams in Spend to deliver and maintain scalable, reliable cloud... ...are not mandatory. Minimum qualifications 6+ years in an SRE, DevOps, or infrastructure-focused engineering role....
- the company | Site Reliability Engineer | San Francisco, CA (Hybrid) | Full-time the company is a no-code data workflow automation tool that helps... ...and our infrastructure runs on AWS. We're looking for an SRE that's passionate about observability and keeping systems healthy...Full time
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Site Reliability Engineer (SRE). Be the first to apply!
Related searches
- site reliability engineer remote San Francisco, CA
- site reliability engineer sre San Francisco, CA
- site reliability engineer San Francisco, CA
- on-site clinical research associate (traveling/remote) San Francisco, CA
- junior website developer San Francisco, CA
- site merchandiser San Francisco, CA
- IT site lead San Francisco, CA
- site leader San Francisco, CA
- site safety San Francisco, CA
- site recruiter San Francisco, CA


