Staff Site Reliability Engineer - Observability GCP
$194k - $267kOkta
Secure Every Identity, from AI to Human
Identity is the key to unlocking the potential of AI. Okta secures AI by building the trusted, neutral infrastructure that enables organizations to safely embrace this new era. This work requires a relentless drive to solve complex challenges with real-world stakes. We are looking for builders and owners who operate with speed and urgency and execute with excellence. This is an opportunity to do career-defining work. We're all in on this mission. If you are too, let's talk.We are seeking a highly technical Observability Site Reliability Engineer with a specialty in Google Cloud, to own and expand our Observability ecosystem into GCP. In this role, you will move beyond simple monitoring to delivering a world class, comprehensive, scalable Observability Platform that enables our SRE teams and business partners. You will treat infrastructure as code —utilizing Terraform and strong coding proficiency in Go, Python, or Ruby —to automate the deployment of agents and collectors across complex distributed systems.
Key Responsibilities
- Automated Infrastructure: Design, build, and maintain scalable observability infrastructure using tools like Terraform.
- GCP Observabilty Engineering: Optimize the collection, processing, and storage of Observabilty data to ensure high reliability and low latency of our Splunk and Grafana services
- Incident Response: Participate in on-call rotations and lead post-incident reviews to drive systemic improvements and "observability-driven development."
- Automation: Eliminate "toil" by automating the deployment and scaling of observability agents and collectors.
Required Skills & Experience (The Essentials)
GKE: Minimum 5+ Experience scaling and managing observability in a Google Cloud platform. Visualization: Expertise in creating intuitive, actionable Splunk or Grafana dashboards that correlate data across multiple sources. SRE Mindset: Minimum 3+ years of experience in an SRE, DevOps, or Systems Engineering role with a focus on high-availability systems.
- Programming Proficiency: Strong coding skills in Python , Go for building internal tools and automating workflows.
- Distributed Systems: Deep understanding of Linux internals, networking (TCP/IP, DNS, Load Balancing), and container orchestration (Kubernetes/GKE).
- Problem Solving: A data-driven approach to debugging complex, cross-service performance bottlenecks.
Bonus Skills (The "Nice-to-Haves")
- Telemetry Standards: Hands-on experience with OpenTelemetry (OTel), Vector, or similar frameworks for instrumenting applications.
- Grafana Loki: Experience in migrating Splunk to Grafana Loki
Other Cloud Platforms: Experience managing observability native tools within AWS.
Additional requirements:
- This position requires the ability to access federal environments and/or have access to protected federal data. As a condition of employment for this position, the successful candidate must be able to submit documentation establishing U.S. Person status (e.g. a U.S. Citizen, National, Lawful Permanent Resident, Refugee, or Asylee. 22 CFR 120.15) upon hire.
#LI-MM
#LI-Hybrid
P24517_3387022
Below is the annual base salary range for candidates located in San Francisco Bay Area. Your actual base salary will depend on factors such as your skills, qualifications, experience, and work location. In addition, Okta offers equity (where applicable), bonus, and benefits, including health, dental and vision insurance, 401(k), flexible spending account, and paid leave (including PTO and parental leave) in accordance with our applicable plans and policies. To learn more about our Total Rewards program please visit: .
The annual base salary range for this position for candidates located in the San Francisco Bay area is between: $194,000—$267,000 USDThe Okta Experience
We are intentional about connection. Our global community, spanning over 20 offices worldwide, is united by a drive to innovate. Your journey begins with an immersive, in-person onboarding experience designed to accelerate your impact and connect you to our mission and team from day one.
Okta is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, ancestry, marital status, age, physical or mental disability, or status as a protected veteran. We also consider for employment qualified applicants with arrest and convictions records, consistent with applicable laws. If reasonable accommodation is needed to complete any part of the job application, interview process, or onboarding please use this Form to request an accommodation. Notice for New York City Applicants & Employees: Okta may use Automated Employment Decision Tools (AEDT), as defined by New York City Local Law 144, that use artificial intelligence, machine learning, or other automated processes to assist in our recruitment and hiring process. In accordance with NYC Local Law 144, if you are an applicant or employee residing in New York City, please click here to view our full NYC AEDT Notice.$111k - $130k
QUEST DIAGNOSTICS INC is seeking a Performance II‑Epic to provide reliability engineering services through observability and performance engineering techniques. The role requires collaboration with product owners, ensuring optimal operation through monitoring system performance...SuggestedRemote job$198.03k - $287.95k
Calendly is looking for a Site Reliability Engineer to enhance its innovative infrastructure platform. This role will empower teams by enabling best... ...have robust experience with cloud technologies, especially GCP, and proficiency in Golang or Python development. Successful...Suggested- ...United States is seeking a Sr. Platform Engineer to manage AWS, GCP, and cloud infrastructure. In this role, you will plan monitoring and observability mechanisms, develop tooling in Rust, and ensure operations meet reliability standards. The ideal candidate has 5+ years...SuggestedRemote jobFlexible hours
- ...in New York is seeking a motivated SRE/DevOps Engineer to enhance their cloud-based platform. The role requires strong experience in observability and DevOps practices to manage public cloud and ensure system reliability. Candidates should have over 5 years of relevant...Suggested
- ..., Ditto's peer-to-peer sync engine ensures devices stay connected... ..., we need experienced Site Reliability Engineers to ensure our infrastructure... ...specialized team focused on observability, system reliability and... ...service provider (AWS, GCP, Azure) Strong communication...SuggestedRemote workFlexible hours
$116.63k - $181.24k
...Wikimedia Foundation is looking for a Senior Site Reliability Engineer to join our team, reporting to the Sr... ...targets. Build and enhance observability systems—metrics, logs, and distributed... ...cloud‑based systems on AWS, Azure, or GCP with scalability, reliability, and cost...$180k - $200k
...and Application Monitoring/Observability: Develop and maintain comprehensive... ...Agreements (SLAs) to ensure reliable and consistent service... ...infrastructure‑related issues Software Engineering for Operations: Develop and... ...platforms such as AWS, GCP, or Azure. Experience with...For contractorsWork at officeWork from homeFlexible hours$93.9k - $156.5k
Hybrid role , 2 days on site. Role is located in NYC with... ...hours: 9am‑5pm EST. Site Reliability EngineerII (Tuesday‑Saturday... ...work alongside senior engineers to learn how we observe, monitor, automate, and improve... ...to Google Cloud Platform (GCP). Collaborate with cross‑...Local area$123k - $165k
Department/Group Overview Our engineering fleet is a horizontal... ...team provides reliability engineering and operational... .... We are seeking a Site Reliability Engineer... ...workflows, enhancing observability, and participating in... ...platforms - AWS (preferred), GCP, Azure. Proficiency...- ...Senior Sales Engineer At Snowflake, we are powering... ...how work gets done. Observe by Snowflake is a high-... ...Ability to travel to client sites and industry events as... ...platforms (AWS, Azure, GCP). ~ Strong verbal and... ..., DevOps, and Site Reliability Engineering (SRE) personas...Contract work
$170k - $190k
As a Medrio Senior Site Reliability Engineer, you will be a part of the ITOps group... ...provider experience (GCP, Azure, AWS, Oracle Cloud)... ...AI/ML tools for automation, observability, predictive maintenance, and... ...Wellness: Medrio values our staff’s well-being. To prove it,...Remote jobTemporary workWork from homeFlexible hours- Freelanceshop is looking for a remote SRE Observability Engineer (Datadog Specialist) to enhance our cloud-based platforms. This critical role involves designing monitoring systems to ensure reliability and performance. You will collaborate with various teams to provide...Remote job
$157.5k - $254.35k
...motivated, driven and creative Senior Site Reliability Engineer to join the Site Reliability team.... ...operational risk, drive improvements in observability, incident response, and production... ...in public cloud (Azure preferred; AWS/GCP acceptable with willingness to learn...Contract workWork at officeLocal areaRemote work$130k
Job Title: Senior Site Reliability Engineer Location: New York City - Hybrid (3 days onsite) Type:... ...infrastructure ecosystem, production operations, observability, reliability engineering, and... ...infrastructure and services across GCP, AWS, and OCI, as well as container...Full time- ...backing of a global organization. As the Site Reliability Engineer, you will help ensure the reliability, scalability, and observability of CloudBlue’s multi-tenant SaaS platforms... ...preferably with Azure; experience with AWS and/or GCP will also be valued Experience working...Remote workWorldwideFlexible hours
$143k - $179k
...connect with your customers reliably and securely, at every step... ...We're looking for a Senior Site Reliability Engineer to join our SRE team, the... ...as Google Cloud Platform (GCP) or Amazon Web Services (AWS... ...with modern monitoring and observability tools such as Prometheus, Grafana...Remote workFlexible hours$111k - $130k
...your role is to provide reliability engineering services through observability and performance engineering... ..., and aiding support staff with resolving incidents.... ...efficiency. You will use Site Reliability Engineering practices... ...AWS/Azure/GCP Certifications Chaos Engineering...Full timePart timeWork experience placementRemote workFlexible hours$210k - $310k
...ecosystem. SDF is looking for a Director of Site Reliability Engineering to lead a small, high-leverage SRE... ...SDF engineering teams build, deploy, observe, and operate software with confidence.... ...modern cloud infrastructure in AWS, GCP, or similar environments. 3+ years of...Temporary workWork at officeLocal areaWorldwideFlexible hours$136k - $180k
As a Staff Site Reliability Engineer, you will be a key technical leader responsible for the architecture... ...of our Google Cloud Platform (GCP) environment, drive our Infrastructure... ...serve as the expert for scalability, observability, and building the robust, automated...Remote work$127k
Position Overview Platform Engineering is the department within SRE... ...internal service mesh), and observability and alerting systems. The Deployments... ...infrastructure, ensuring reliable code deployment from... ...AWS, Google Cloud Platform (GCP), or Azure Understanding of...Local areaFlexible hours$165k - $215k
...seeking a highly skilled Senior DevOps / Site Reliability Engineer (SRE) to join our globally... ...in Kubernetes, cloud infrastructure, observability, automation, CI/CD, incident management... ...cloud infrastructure across OCI, AWS, GCP, or Azure environments. Develop and...- ...features at a massive scale most engineers never get to touch. We're... ...SRE who cares deeply about reliability and scalability. The work... ...Own reliability across our GCP infrastructure: Kubernetes clusters... ...Build and maintain observability across the stack: metrics, dashboards...Work at office
- DroneUp, LLC is hiring an SRE - Platform Engineer in the United States, focusing on the reliability and performance of their IT infrastructure while mentoring teams... ...while working with cloud technologies such as GCP. Ideal candidates should have strong Kubernetes and...
- ...leadership role for a senior engineer who can own Zenith’s... .... You will lead all reliability, performance, and... ...implement incident response, observability, monitoring, alerting,... ...7+ years in site reliability engineering... ...cloud platforms (AWS/GCP/Azure), containerization...Remote work
$114k - $148k
OneStream Software is actively seeking a Site Reliability Engineer to join their remote team. In this vital role, you will ensure the reliability, performance, and availability of the platform and services. The ideal candidate will have extensive cloud infrastructure experience...Remote job$148.7k - $199.4k
...Disney Entertainment & Sports LLC is seeking a Senior Software Engineer - AI and Observability in New York. You will lead the design of AI-driven systems crucial for Disney’s streaming services, ensuring reliability and performance. With a strong background in backend...- ...the platforms and tooling that help engineering teams develop, deploy, and operate production... ...for every product team. As a Staff Site Reliability Engineer on Release Engineering, you... ...readiness through expertise in observability, incident response, and scalable deployment...Permanent employmentWork experience placementLocal area
$185k - $200k
Branch Messenger Inc. is seeking a Staff Cloud Operations Engineer to join their Cloud Ops team, focusing... ...maintaining cloud infrastructure in GCP. This remote role emphasizes... ...engineering, incident response, and observability tools. The salary range for this position...Remote job- Stack AI, Inc. in New York is looking for a Platform Engineer to enhance their API and observability solutions. You will play a crucial role in designing a product-level API, keeping developer experience paramount while implementing analytics for agent performance. This...
- ...customer who wants to understand what their agents are doing relies on the analytics you build. We're hiring a Platform Engineer, APIs & Observability to own both: the public API the ecosystem builds on, and the observability and analytics that make a platform full of...Contract work
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Staff Site Reliability Engineer - Observability GCP. Be the first to apply!
- research assistant engineering New York, NY
- staff security engineer New York, NY
- staff engineer New York, NY
- assistant chief engineer New York, NY
- senior staff systems engineer New York, NY
- assistant electrical engineer New York, NY
- assistant engineering manager New York, NY
- project engineer assistant project manager New York, NY
- staff automation engineer New York, NY
- engineering aide New York, NY
