Staff Site Reliability Engineer - Observability GCP

$194k - $267k

Okta

Secure Every Identity, from AI to Human

Identity is the key to unlocking the potential of AI. Okta secures AI by building the trusted, neutral infrastructure that enables organizations to safely embrace this new era. This work requires a relentless drive to solve complex challenges with real-world stakes. We are looking for builders and owners who operate with speed and urgency and execute with excellence.

This is an opportunity to do career-defining work. We're all in on this mission. If you are too, let's talk.

We are seeking a highly technical Observability Site Reliability Engineer with a specialty in Google Cloud, to own and expand our Observability ecosystem into GCP. In this role, you will move beyond simple monitoring to delivering a world class, comprehensive, scalable Observability Platform that enables our SRE teams and business partners. You will treat infrastructure as code —utilizing Terraform and strong coding proficiency in Go, Python, or Ruby —to automate the deployment of agents and collectors across complex distributed systems.

Key Responsibilities

Automated Infrastructure: Design, build, and maintain scalable observability infrastructure using tools like Terraform.
GCP Observabilty Engineering: Optimize the collection, processing, and storage of Observabilty data to ensure high reliability and low latency of our Splunk and Grafana services
Incident Response: Participate in on-call rotations and lead post-incident reviews to drive systemic improvements and "observability-driven development."
Automation: Eliminate "toil" by automating the deployment and scaling of observability agents and collectors.

Required Skills & Experience (The Essentials)

GKE: Minimum 5+ Experience scaling and managing observability in a Google Cloud platform. Visualization: Expertise in creating intuitive, actionable Splunk or Grafana dashboards that correlate data across multiple sources. SRE Mindset: Minimum 3+ years of experience in an SRE, DevOps, or Systems Engineering role with a focus on high-availability systems.

Programming Proficiency: Strong coding skills in Python , Go for building internal tools and automating workflows.
Distributed Systems: Deep understanding of Linux internals, networking (TCP/IP, DNS, Load Balancing), and container orchestration (Kubernetes/GKE).
Problem Solving: A data-driven approach to debugging complex, cross-service performance bottlenecks.

Bonus Skills (The "Nice-to-Haves")

Telemetry Standards: Hands-on experience with OpenTelemetry (OTel), Vector, or similar frameworks for instrumenting applications.
Grafana Loki: Experience in migrating Splunk to Grafana Loki

Other Cloud Platforms: Experience managing observability native tools within AWS.

Additional requirements:

This position requires the ability to access federal environments and/or have access to protected federal data. As a condition of employment for this position, the successful candidate must be able to submit documentation establishing U.S. Person status (e.g. a U.S. Citizen, National, Lawful Permanent Resident, Refugee, or Asylee. 22 CFR 120.15) upon hire.

#LI-MM
#LI-Hybrid

P24517_3387022

Below is the annual base salary range for candidates located in San Francisco Bay Area. Your actual base salary will depend on factors such as your skills, qualifications, experience, and work location. In addition, Okta offers equity (where applicable), bonus, and benefits, including health, dental and vision insurance, 401(k), flexible spending account, and paid leave (including PTO and parental leave) in accordance with our applicable plans and policies. To learn more about our Total Rewards program please visit: .

The annual base salary range for this position for candidates located in the San Francisco Bay area is between: $194,000—$267,000 USD

The Okta Experience

We are intentional about connection. Our global community, spanning over 20 offices worldwide, is united by a drive to innovate. Your journey begins with an immersive, in-person onboarding experience designed to accelerate your impact and connect you to our mission and team from day one.

Okta is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, ancestry, marital status, age, physical or mental disability, or status as a protected veteran. We also consider for employment qualified applicants with arrest and convictions records, consistent with applicable laws.

If reasonable accommodation is needed to complete any part of the job application, interview process, or onboarding please use this Form to request an accommodation.

Notice for New York City Applicants & Employees: Okta may use Automated Employment Decision Tools (AEDT), as defined by New York City Local Law 144, that use artificial intelligence, machine learning, or other automated processes to assist in our recruitment and hiring process. In accordance with NYC Local Law 144, if you are an applicant or employee residing in New York City, please click here to view our full NYC AEDT Notice.

Apply

Vacancy posted a month ago

Similar jobs that could be interesting for youBased on the Staff Site Reliability Engineer - Observability GCP in San Francisco, CA vacancy

Software Engineer, Observability
...Join the engineering teams that bring OpenAI’s ideas safely to the... ...that they are performant and reliable. You will work in a deeply... ...crucial role in ensuring the observability, reliability, scalability,... ...platforms (e.g., AWS, Azure, GCP). Strong background in software...
Suggested
Full time
Work experience placement
Relocation package
OpenAI
San Francisco, CA
21 hours ago
Site Reliability Engineer II
$86k - $105k
...infrastructure and to be responsible for reliability, automation and scalability... ...Implement and evangelize Observability and monitoring systems to... ...prior DevOps, software engineering or related experience.... ...infrastructure providers – AWS, GCP and Azure Background and...
Suggested
Hourly pay
Work at office
Immediate start
Visa sponsorship
Work visa
Flexible hours
Early Warning Services
San Francisco, CA
2 days ago
Site Reliability Engineer (SRE)
$170k - $230k
...Site Reliability Engineer (SRE) Palo Alto / San Francisco Bay Area About Mithril Mithril is... ...role — you will build the automation, observability, and tooling that allows Mithril to... ...in at least one major provider (AWS, GCP, or Azure), including practical understanding...
Suggested
Work at office
Local area
1 day per week
Mithril
San Francisco, CA
21 hours ago
Senior Software Engineer, Observability
$200k - $280k
...combining the fastest LLM inference engine with state-of-the-art AI... ...AI platform. The storage and observability team is crucial for designing... ...monitoring services (AWS, GCP, Azure). Strong programming skills... ..., chaos engineering, and reliability testing. Contributions to open...
Suggested
Full time
Remote work
Together AI
San Francisco, CA
21 hours ago
Site Reliability Engineer II
$98.58k - $138.02k
...Site Reliability Engineer II Restaurant365 is a SaaS company disrupting the restaurant industry!... ...monitoring tools and platforms to improve observability. Promote and apply best practices... ...in cloud services (Azure, AWS, or GCP) and container platforms (EKS, ECS, AKS...
Suggested
Work at office
Restaurant365
San Francisco, CA
3 days ago
Site Reliability Engineer
...of former Google and Stripe engineers, including the founding team... ...for a skilled and passionate Site Reliability Engineer to join our team.... ...for the reliability, observability, performance, and security of... ...on cloud platforms like AWS, GCP, Azure, and emergent Neo-Clouds...
Remote work
1 day per week
Runloop AI, Inc
San Francisco, CA
21 hours ago
Senior Software Engineer - Site Reliability Engineering
...Udaip Cloud-Based Data And Ai Platform Engineer At U.S. Bank, we're on a journey to do... ...on major cloud service providers (Azure, GCP, AWS) Create and maintain automation... ...Docker Containers, and Splunk Logging & Observability Experience with Linux containerization...
Temporary work
Work experience placement
Phenom People
San Francisco, CA
21 hours ago
Site Reliability Engineer (SRE)
$170k - $250k
...Site Reliability Engineer (SRE) Location: San Francisco, CA / Palo Alto, CA Company Stage of... ...engineering to build the automation, observability, and platform infrastructure that powers... ...experience across AWS, GCP, Azure, or multi-cloud environments....
Work at office
Visa sponsorship
Flexible hours
Recruiting from Scratch
San Francisco, CA
2 days ago
Director, Site Reliability Engineering
$205k - $305k
...Director Of Site Reliability Engineering Interested in working on cutting-edge blockchain technology... ...SDF engineering teams build, deploy, observe, and operate software with confidence.... ...with modern cloud infrastructure in AWS, GCP, or similar environments. ~3+ years...
Temporary work
Work at office
Local area
Worldwide
Flexible hours
Stellar
San Francisco, CA
10 days ago
Software Engineer, Observability
...Join the engineering teams that bring OpenAI’s ideas safely to the world!! The Applied Engineering... ...About the Role We’re building the observability product for OpenAI—from scalable... ...tools to make OpenAI's production systems reliable, performant, and observable. What...
Full time
OpenAI
San Francisco, CA
21 hours ago
Software Engineer- GPU Fabric Observability
...Capital. Join us and help build the platform engineers turn to to ship AI products. THE... ...Engineer to build a first-class observability and root-cause analysis system for GPU fabrics... ...post-incident learning. REQUIREMENTS Staff-level or senior staff-level experience...
Full time
Flexible hours
Baseten
San Francisco, CA
21 hours ago
FullStack Engineer, AI Observability & Evals Platform (LangSmith)
$145k - $180k
...intelligent agents ubiquitous. We build the foundation for agent engineering in the real world, helping developers move from prototypes to... ...a Fullstack Engineer to work on LangSmith, our commercial AI observability and evals platform product. In this role, you'll have the...
Full time
Work at office
Flexible hours
LangChain
San Francisco, CA
21 hours ago
Senior Fullstack Engineer, AI Observability & Evals Platform
$175k - $240k
...intelligent agents ubiquitous. We build the foundation for agent engineering in the real world, helping developers move from prototypes to... ...Fullstack Engineer for our commercial product LangSmith, an observability and evals platform. In this role, you'll have the opportunity...
Full time
Work at office
Flexible hours
LangChain
San Francisco, CA
21 hours ago
Full Stack Engineer, Observability & Evals Platform (LangSmith)
$140k - $175k
...their organization. Founded in 2023, LangChain powers top engineering teams at companies like Replit, Lovable, Clay, Klarna,... ...for a Full Stack Engineer to work on LangSmith, our commercial observability and evals platform product. In this role, you'll have the opportunity...
Full time
LangChain
San Francisco, CA
21 hours ago
Software Engineer, Security Observability
...the Role We are seeking a Software Engineer, Security Observability to join our Security team. In this... ...Proactively improve the resilience and reliability of data systems to ensure high... ...technical domains such as databases, site reliability engineering (SRE), or security...
Full time
Remote work
Relocation package
OpenAI
San Francisco, CA
21 hours ago
Senior Site Reliability Engineer
...redefine computing. About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI... ...flags, and automated rollback mechanisms Proficient in observability tools and practices including metrics, logging, tracing,...
Hyperbolic Labs
San Francisco, CA
21 hours ago
Site Reliability Engineer
$152.5k - $219.2k
...solutions that improve the reliability, scalability, and operational... ...environments. Partner with other engineering teams, product management,... ...~2+ years of experience in Site Reliability Engineering,... ...Knowledge of monitoring, observability, and reliability engineering...
Full time
Temporary work
Local area
Worldwide
Flexible hours
Webex Events (formerly Socio)
San Francisco, CA
4 days ago
Site Reliability Engineer
$100k - $170k
...Site Reliability Engineer Houston; San Francisco; Seattle About Nscale Nscale is the GPU cloud built for AI. We run high-performance... ...through to the retro. ~ Fluency with monitoring and observability; metrics, logs, dashboards, and alerting. ~ Comfort in...
Flexible hours
Shift work
Nscale
San Francisco, CA
21 hours ago
Site Reliability Engineer
$150k
...Site Reliability Engineer San Francisco, CA About The Role We are seeking an experienced Site Reliability Engineer (SRE) with a strong... ...external APIs; implement alerting and dashboards using observability tooling (e.g., CloudWatch, Datadog, Grafana). Lead...
VantageScore®
San Francisco, CA
1 day ago
Software Engineer, Observability
...will build, integrate, and evangelize observability platforms and solutions for our products... ...our products highly available, scalable, reliable, observable and delight our customers.... ...product that improves productivity of engineers across the globe by several orders of magnitude...
Retool
San Francisco, CA
3 days ago
Senior Site Reliability Engineer
...About the job Senior Site Reliability Engineer About the Company Stellar is a decentralized, public blockchain that gives developers the tools... ...Maintain, improve, scale and secure our AWS/GCP infrastructure and Linux systems. Assist our development...
TechChain Talent
San Francisco, CA
5 days ago
Site Reliability Engineer (SRE)
...Site Reliability Engineer (SRE) FLUIX is building the AI operating system that plans, designs, and optimizes AI infrastructure. We are based in... ...managing and optimizing cloud infrastructure (AWS preferred, or GCP, Azure), experience with ML and AI technologies, and...
Work at office
Weekend work
Fluix AI
San Francisco, CA
21 hours ago
Junior Software Developer - Observability
...Junior Software Developer – Observability Join to apply for the Junior Software Developer – Observability role at Canonical Canonical is... ...enterprise initiatives such as public cloud, data science, AI, engineering innovation, and IoT. Our customers include the world's...
Work at office
Remote work
Work from home
Canonical
San Francisco, CA
3 days ago
Software Engineer, Observability (Backend)
$202k - $237k
...date. About the role We are seeking a Backend Software Engineer to join our team focused on building user-facing application... ...of these features. About the team The Workspace & Observability Team is dedicated to empowering clients to create robust AI applications...
Full time
Work at office
Flexible hours
Anyscale
San Francisco, CA
21 hours ago
Manager, Site Reliability Engineering
$204k - $281k
...in on this mission. If you are too, let’s talk. Manager, Site Reliability Engineering San Francisco, California Okta authenticates, authorizes... ...teams focused on Edge networking, K8s platform, CI/CD, Observability, automation platform & tooling. What you'll be doing Managing...
Permanent employment
Worldwide
Flexible hours
Okta, Inc.
San Francisco, CA
21 hours ago
Senior Site Reliability Engineer
$210k - $240k
Join to apply for the Senior Site Reliability Engineer role at Alembic Technologies This range is provided by Alembic Technologies. Your actual... ...(SRE) to help us scale our platform with reliability, observability, and operational excellence at the core. You’ll partner...
Full time
Alembic Technologies
San Francisco, CA
2 days ago
Remote Junior Observability Software Engineer (Python/Go)
...A global open source software provider is seeking a Junior Software Developer for their Observability team. This remote position requires expertise in Python and a working knowledge of Go. The successful candidate will develop a cloud-native monitoring stack, collaborating...
Remote work
Canonical
San Francisco, CA
3 days ago
Site Reliability Engineer
...our new San Francisco headquarters. About the Role As a Site Reliability Engineer (SRE) at Mercor, you’ll own production reliability across... ...so they are stable, resource‑efficient, isolated, and well‑observed. Introduce and champion modern SRE practices (e.g., incident...
Mercor
San Francisco, CA
2 days ago
Senior Site Reliability Engineer
$175k - $250k
...Senior Cloud Infrastructure Engineer Location: San Francisco, CA.... ...Remote unavailable. Modality: On-Site only. Must live within... ...scalability, performance, and reliability across environments. What... ...systems for orchestration, observability, distributed storage, and networking...
Full time
Remote work
Relocation
Relocation package
The Recruiting Guy
San Francisco, CA
2 days ago
Senior Software Engineer, Site Reliability Engineering
$210.8k - $272.8k
...millions of people confidently care for their homes. About the Site Reliability Engineering Team The Site Reliability Engineering team focuses on... ...Demonstrable knowledge of instrumenting, operating, and observing a distributed system of microservices in a production...
Local area
Thumbtack
San Francisco, CA
1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Staff Site Reliability Engineer - Observability GCP. Be the first to apply!