Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Staff Site Reliability Engineer - Observability GCP

$194k - $267k

Okta

Secure Every Identity, from AI to Human

Identity is the key to unlocking the potential of AI. Okta secures AI by building the trusted, neutral infrastructure that enables organizations to safely embrace this new era. This work requires a relentless drive to solve complex challenges with real-world stakes. We are looking for builders and owners who operate with speed and urgency and execute with excellence.

This is an opportunity to do career-defining work. We're all in on this mission. If you are too, let's talk.

We are seeking a highly technical Observability Site Reliability Engineer with a specialty in Google Cloud, to own and expand our Observability ecosystem into GCP. In this role, you will move beyond simple monitoring to delivering a world class, comprehensive, scalable Observability Platform that enables our SRE teams and business partners. You will treat infrastructure as code —utilizing Terraform and strong coding proficiency in Go, Python, or Ruby —to automate the deployment of agents and collectors across complex distributed systems.

Key Responsibilities

  • Automated Infrastructure: Design, build, and maintain scalable observability infrastructure using tools like Terraform.
  • GCP Observabilty Engineering: Optimize the collection, processing, and storage of Observabilty data to ensure high reliability and low latency of our Splunk and Grafana services
  • Incident Response: Participate in on-call rotations and lead post-incident reviews to drive systemic improvements and "observability-driven development."
  • Automation: Eliminate "toil" by automating the deployment and scaling of observability agents and collectors.

Required Skills & Experience (The Essentials)

GKE: Minimum 5+ Experience scaling and managing observability in a Google Cloud platform. Visualization: Expertise in creating intuitive, actionable Splunk or Grafana dashboards that correlate data across multiple sources. SRE Mindset: Minimum 3+ years of experience in an SRE, DevOps, or Systems Engineering role with a focus on high-availability systems.

  • Programming Proficiency: Strong coding skills in Python , Go for building internal tools and automating workflows.
  • Distributed Systems: Deep understanding of Linux internals, networking (TCP/IP, DNS, Load Balancing), and container orchestration (Kubernetes/GKE).
  • Problem Solving: A data-driven approach to debugging complex, cross-service performance bottlenecks.

Bonus Skills (The "Nice-to-Haves")

  • Telemetry Standards: Hands-on experience with OpenTelemetry (OTel), Vector, or similar frameworks for instrumenting applications.
  • Grafana Loki: Experience in migrating Splunk to Grafana Loki

Other Cloud Platforms: Experience managing observability native tools within AWS.

Additional requirements:

  • This position requires the ability to access federal environments and/or have access to protected federal data.  As a condition of employment for this position, the successful candidate must be able to submit documentation establishing U.S. Person status (e.g. a U.S. Citizen, National, Lawful Permanent Resident, Refugee, or Asylee. 22 CFR 120.15) upon hire.

#LI-MM
#LI-Hybrid

P24517_3387022

Below is the annual base salary range for candidates located in San Francisco Bay Area. Your actual base salary will depend on factors such as your skills, qualifications, experience, and work location. In addition, Okta offers equity (where applicable), bonus, and benefits, including health, dental and vision insurance, 401(k), flexible spending account, and paid leave (including PTO and parental leave) in accordance with our applicable plans and policies. To learn more about our Total Rewards program please visit: .   

The annual base salary range for this position for candidates located in the San Francisco Bay area is between: $194,000—$267,000 USD

The Okta Experience

We are intentional about connection. Our global community, spanning over 20 offices worldwide, is united by a drive to innovate. Your journey begins with an immersive, in-person onboarding experience designed to accelerate your impact and connect you to our mission and team from day one.

Okta is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, ancestry, marital status, age, physical or mental disability, or status as a protected veteran. We also consider for employment qualified applicants with arrest and convictions records, consistent with applicable laws.

If reasonable accommodation is needed to complete any part of the job application, interview process, or onboarding please  use this Form to request an accommodation.

Notice for New York City Applicants & Employees: Okta may use Automated Employment Decision Tools (AEDT), as defined by New York City Local Law 144, that use artificial intelligence, machine learning, or other automated processes to assist in our recruitment and hiring process. In accordance with NYC Local Law 144, if you are an applicant or employee residing in New York City, please  click here to view our full NYC AEDT Notice.
Vacancy posted 16 days ago
Similar jobs that could be interesting for youBased on the Staff Site Reliability Engineer - Observability GCP in San Francisco, CA vacancy
  •  ...platform is seeking an experienced SRE Engineer to ensure the reliability and resilience of their...  ...leading incident response, improving observability, and collaborating with various teams...  ...or related fields, particularly with GCP and Kubernetes, and a proven record... 
    Suggested

    Speak

    San Francisco, CA
    2 days ago
  • $170k - $230k

     ...Site Reliability Engineer (SRE) Palo Alto / San Francisco Bay Area About Mithril Mithril is...  ...role — you will build the automation, observability, and tooling that allows Mithril to...  ...in at least one major provider (AWS, GCP, or Azure), including practical understanding... 
    Suggested
    Work at office
    Local area
    1 day per week

    Mithril

    San Francisco, CA
    5 days ago
  •  ...significantly outperforms individual engineers. We combine language models...  ...are seeking an experienced Site Reliability Engineer to join our...  ...comprehensive monitoring, alerting, and observability solutions using Datadog and...  ...or Google Cloud Platform (GCP), including compute, storage... 
    Suggested

    CodeRabbit

    San Francisco, CA
    5 days ago
  • $98.58k - $138.02k

     ...Site Reliability Engineer II Restaurant365 is a SaaS company disrupting the restaurant industry!...  ...monitoring tools and platforms to improve observability. Promote and apply best practices...  ...in cloud services (Azure, AWS, or GCP) and container platforms (EKS, ECS, AKS... 
    Suggested
    Work at office

    Restaurant365

    San Francisco, CA
    3 days ago
  • $170k - $250k

     ...Site Reliability Engineer (SRE) Location: San Francisco, CA / Palo Alto, CA Company Stage of...  ...engineering to build the automation, observability, and platform infrastructure that powers...  ...experience across AWS, GCP, Azure, or multi-cloud environments.... 
    Suggested
    Work at office
    Visa sponsorship
    Flexible hours

    Recruiting from Scratch

    San Francisco, CA
    5 days ago
  •  ...Udaip Cloud-Based Data And Ai Platform Engineer At U.S. Bank, we're on a journey to do...  ...on major cloud service providers (Azure, GCP, AWS) Create and maintain automation...  ...Docker Containers, and Splunk Logging & Observability Experience with Linux containerization... 
    Temporary work
    Work experience placement

    Phenom People

    San Francisco, CA
    3 days ago
  •  ...in the systems they read from, not just observe them. We started as the open-source...  ...You'll be the infrastructure and reliability engineer on the Data Replication team - a full-stack...  ...resource configuration across AWS and GCP. Partner with product engineers to reliably... 
    Work at office
    Local area
    Remote work
    Flexible hours

    Airbyte

    San Francisco, CA
    3 days ago
  • $205k - $305k

     .... SDF is looking for a Director of Site Reliability Engineering to lead a small, high-leverage SRE team...  ...SDF engineering teams build, deploy, observe, and operate software with confidence....  ...with modern cloud infrastructure in AWS, GCP, or similar environments. ~3+ years... 
    Temporary work
    Work at office
    Local area
    Remote work
    Worldwide
    Flexible hours

    Stellar

    San Francisco, CA
    5 days ago
  • We are seeking a Sr. Site Reliability Engineer to join our team and run critical infrastructure for...  ...node operation experience 1+ years of GCP experience Must play well with others...  ...experiences. They enjoy building testing and observability capabilities that will accelerate the... 
    Remote job

    Blockchain Works

    San Francisco, CA
    21 days ago
  • What you’ll do As a Senior Site Reliability Engineer, you’ll work closely with product teams in Spend...  ...readiness. Lead incident response, observability, and automation across critical systems...  .... Expertise in cloud platforms (AWS/GCP), Kubernetes, observability, and... 

    Airwallex-

    San Francisco, CA
    3 days ago
  • $127k - $249k

    The Team Platform Engineering is the department within SRE that is...  ...internal service mesh), and observability and alerting systems. The Fleet...  ...that ensure cluster reliability and security (e.g., CoreDNS,...  ...infrastructure platforms, including AWS, GCP, or Azure Proficiency in... 
    Work at office
    Local area
    Remote work
    Worldwide
    Flexible hours

    MongoDB

    San Francisco, CA
    1 day ago
  • $163k - $203k

     ...SRE team, responsible for the reliability, scalability, and security...  ...This is as much of a platform engineering role as it is SRE role — you...  ..., CI/CD pipelines, and observability while simultaneously building...  ...with a major cloud provider (GCP preferred) and Kubernetes Strong... 
    Work experience placement
    Work at office
    Local area
    Remote work
    Flexible hours
    2 days per week

    Prosper

    San Francisco, CA
    5 days ago
  •  ...role We're looking for a world-class Site Reliability Engineer to ensure the reliability, performance...  ...our reliability posture end-to-end—observability, performance tuning, incident ops, infrastructure...  ...with a major cloud provider (AWS, GCP) Solid knowledge of Linux systems,... 

    Blaxel

    San Francisco, CA
    5 days ago
  • A dynamic tech firm located in San Francisco is seeking a Site Reliability Engineer to enhance operational health across their production systems. This high-impact role demands expertise in AWS and strong programming skills. You will manage production systems' reliability... 

    gamma.app

    San Francisco, CA
    2 days ago
  • Sr. Site Reliability Engineer Job type: Full Time · Department: Platform · Work...  ...providing the control and observability needed to scale safely. Built...  ...infrastructure across AWS, GCP, and Azure — provisioning,...  ...a senior SRE, platform, or staff infrastructure role Deep Kubernetes... 
    Full time
    Remote work

    Neara

    San Francisco, CA
    5 days ago
  • $181k - $263k

    ## Senior Staff Site Reliability EngineerApplylocations: San Franciscotime type...  ...Staff Site Reliability Engineer who will set the technical...  ...across teams* Expertise in observability engineering—SLOs, SLI pipelines...  ..., SOC 2 / ISO 27001 (GCP and/or AWS)* Peer-recognized... 
    Work from home
    Flexible hours
    Night shift

    LiveRamp

    San Francisco, CA
    4 days ago
  •  ...role Anyscale is looking for a Senior Site Reliability Engineer to join the Infrastructure team....  ...reliability, performance, scalability, and observability of Anyscale-managed Ray workloads...  ...cloud‑native technologies (AWS, Azure, GCP) and Kubernetes-based deployments Deep... 

    Anyscale

    San Francisco, CA
    3 days ago
  • $127k - $249k

     ...are looking for an experienced Senior or Staff Engineer for our SRE, InfraSec team, to guide...  ...solutions for cloud platforms (AWS, Azure, GCP), including network and compute...  ...areas, such as runtime scanning, security observability, CSPM, and more Cloud Expertise: Strong... 
    Local area
    Remote work
    Flexible hours

    Insider, Inc.

    San Francisco, CA
    4 days ago
  •  ...native systems. As a Staff Platform Engineer, you will play a critical...  ...role. You will own reliability for major platform...  ...enhance centralized Observability and Monitoring platforms...  ...Engineering, or Site Reliability Engineering...  ...Cloud Provider (AWS, GCP, or Azure); multi‑cloud... 

    Saviynt

    San Francisco, CA
    19 days ago
  •  ...About the role We’re hiring an SRE to join our engineering team at Plenful and take ownership of the reliability and performance of the systems that power our...  ...across the team. What you’ll do Reliability, Observability and Performance: Maintain and evolve... 
    Work at office
    Remote work
    Flexible hours
    2 days per week

    Plenful

    San Francisco, CA
    2 days ago
  • A technology company is seeking a Cloud Infrastructure Engineer to build and maintain scalable infrastructure and provide reliable support for customers across AWS, Azure, and GCP. The ideal candidate will have 5+ years of experience in DevOps or Infrastructure Engineering... 
    Flexible hours

    Brain Trust Inc

    San Francisco, CA
    2 days ago
  • $75.2k - $95.3k

     ...motivated and high-potential entry-level Site Reliability Engineer (SRE) to join our team and help drive...  ...to reduce operational toil, improve observability, and accelerate root cause analysis...  ...computing concepts (AWS, Azure, or GCP). Good understanding of version control... 
    Full time
    Work experience placement
    Flexible hours

    WEX

    San Francisco, CA
    3 days ago
  • $300k

     ...mission is to create reliable, interpretable, and steerable...  ...researchers, engineers, policy experts, and business...  ...performance using observability data. Manage multi‑region...  ...infrastructure (AWS, GCP, Azure). Python or...  ...‑based hybrid policy: staff to be in one office at... 
    Work at office
    Worldwide
    Visa sponsorship

    United States Digital Space LLC

    San Francisco, CA
    1 day ago
  •  ...redefine computing. About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI...  ...flags, and automated rollback mechanisms Proficient in observability tools and practices including metrics, logging, tracing,... 

    Hyperbolic Labs

    San Francisco, CA
    3 days ago
  • $106k - $130k

     ...generation of application infrastructure and to be responsible for reliability, automation and scalability using and the latest best...  ...and disaster recovery scenarios. Implement and evangelize Observability and monitoring systems to proactively detect problems and identify... 
    Hourly pay
    Work experience placement
    Work at office
    Immediate start
    Visa sponsorship
    Work visa
    Flexible hours

    Early Warning Services

    San Francisco, CA
    2 days ago
  • $150k

     ...Site Reliability Engineer San Francisco, CA About The Role We are seeking an experienced Site Reliability Engineer (SRE) with a strong...  ...external APIs; implement alerting and dashboards using observability tooling (e.g., CloudWatch, Datadog, Grafana). Lead... 

    VantageScore®

    San Francisco, CA
    4 days ago
  •  ...About the job Senior Site Reliability Engineer About the Company Stellar is a decentralized, public blockchain that gives developers the tools...  ...Maintain, improve, scale and secure our AWS/GCP infrastructure and Linux systems. Assist our development... 

    TechChain Talent

    San Francisco, CA
    4 days ago
  •  ...continue to scale. About the role We're hiring a Site Reliability Engineer (SRE) to ensure the reliability, performance, and scalability...  ...and automation to reduce MTTR (Mean Time to Recovery) Observability & System Insight Design and evolve observability... 
    Work at office
    Remote work
    Flexible hours
    2 days per week

    Plenful

    San Francisco, CA
    3 days ago
  •  ...Site Reliability Engineer (SRE) FLUIX is building the AI operating system that plans, designs, and optimizes AI infrastructure. We are based in...  ...managing and optimizing cloud infrastructure (AWS preferred, or GCP, Azure), experience with ML and AI technologies, and... 
    Work at office
    Weekend work

    Fluix AI

    San Francisco, CA
    3 days ago
  • Somi AI in San Francisco is looking for a Software Engineer to join our Insights team. You will design and implement solutions that enhance database observability across our systems, collaborating with various teams to ensure performance metrics are effectively reported... 

    Somi AI

    San Francisco, CA
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Staff Site Reliability Engineer - Observability GCP. Be the first to apply!