Senior Site Reliability Engineer

Careers at Drata

# Senior Site Reliability EngineerHybrid - San Francisco**Our Mission & Values:** At Drata, we help companies earn and keep the trust of their users, customers, partners, and prospects. We’re the proof layer that shows great companies deserve the trust they aim to build.We live our values every day. Built on Trust means consistency is everything. Act with Integrity by always doing the right thing. Being Customer-Obsessed keeps the people we serve at the center of our work. Competitive Fire drives us to push ourselves harder than anyone else. Diversity brings unique perspectives that lead to better solutions. Automation First ensures we save time and money by making efficiency a priority.**Our Culture & Work Style **At Drata, we’re not just building software - we’re building a mindset. Everything we do springs from:* **Be a Driver (Owner‐Operator Mentality):** Own your work. Improve relentlessly. Deliver results.* **Move at Drata Speed (Precision & Velocity):** Fast decisions. Quick learning. Immediate impact.* **Stay Mission-Driven (Customer‐Obsessed):** Challenge assumptions. Deliver value. Stay hungry.We pair that high-velocity culture with a **thoughtful hybrid model** because we believe flexibility and collaboration both matter. That’s why in the Bay we come together **in-office Tuesday through Thursday** our high‐impact collaboration days where teams align, strategize, and innovate. Mondays and Fridays are flexible, giving you space for focused work, balance, and autonomy.If you thrive when you’re empowered, energized, and working with smart, mission-driven people where you’ll feel at home here.The best way to understand the Driver’s Mindset is to see it in action. We’re an award-winning, mission-driven team of **600+ people worldwide**, united by a culture that values trust, speed, and continuous growth.* Watch our CEO, Adam Markowitz, discuss the hyper-growth journey, from $0 to $100M ARR in just four years* : Explore our "Life at Drata" page for employee testimonials on our collaborative and the growth opportunities available.* : See why we are consistently recognized on Fortune's Best Workplaces lists.* Connect with Us on Socials: - follow us for company updates, employee stories, and career news.**Job Summary:**Drata's SRE team operates as both a central engineering function and an embedded reliability practice. You'll be part of a close-knit SRE team where you grow your career, shape standards, and collaborate with peers - while also serving as the dedicated reliability partner for one of Drata's product engineering teams across the full lifecycle of their work.This is a highly technical role at the intersection of software engineering and systems engineering. The best SREs at Drata are engineers first: they solve problems by building solutions, not by executing manual processes. Automation is a core value, and nowhere is that more visible than in how we approach reliability.Our infrastructure runs on AWS across multiple accounts, defined entirely in Terraform. You'll work across a modern cloud-native stack to help Drata scale reliably for a rapidly growing customer base.**What you’ll do:***Reliability Architecture for Your Product Team*You are the reliability expert for your aligned product team. You engage early - during architecture reviews and design discussions - to surface risks before they become incidents.* Lead Production Readiness Reviews (PRRs) before new services launch, with the authority to flag gaps and gate launches when critical reliability standards aren't met* Partner with product engineering leads and staff engineers to define SLOs and SLIs for critical services, turning reliability from a vague goal into a measurable commitment* Participate in team planning and architecture reviews to provide proactive reliability guidance* Build reusable artifacts - SLO templates, observability checklists, alerting standards, reference dashboards - that raise the reliability floor across the team, not just the services you touch directly*Eliminating Toil Through Engineering*You handle operational needs from your product team, but your job isn't to be a help desk. Your goal is to make each request the last of its kind. When an engineer needs something, your priority is: automate it so anyone can do it → document it so the team can self-serve → execute it manually only as a last resort.* Build and maintain Datadog monitors, dashboards, and alert routing - enforcing infrastructure-as-code standards via Terraform so those resources are owned, versioned, and auditable* Handle infrastructure requests: ECS task management, secret rotations, Terraform changes, capacity adjustments* Identify repeated manual work and convert it into self-service tooling or runbooks* Audit existing services for reliability anti-patterns and surface top risks before they cause incidents*Central SRE Platform Work*Beyond your product team, you contribute to cross-cutting infrastructure, tooling, and standards that benefit every team at Drata. Recent examples include automated Datadog governance workflows, dynamic AWS account provisioning, and disaster recovery exercises.* Design and build shared platform infrastructure - reusable Terraform modules, standardized observability stacks, service templates - so reliability improvements compound across the organization* Participate in the on-call rotation and lead incident response when needed; conduct thorough post-incident reviews to drive lasting fixes* Design and manage CI/CD pipelines using GitHub Actions* Contribute to evolving SRE standards, tooling, and practices across the organization**What you'll bring:*** 6+ years of experience in Site Reliability Engineering, Cloud Engineering, or building and maintaining scalable, resilient services* Robust knowledge of cloud computing technologies: Terraform, Docker, Git, and Linux* Hands-on experience with Datadog for monitoring, alerting, dashboards, SLO tracking, and distributed tracing* Experience building software systems as a software engineer* Experience developing tooling and automation in Python and/or Bash* Experience with CI/CD pipeline automation, specifically GitHub Actions* Experience with disaster recovery practices and incident management* Strong understanding of observability concepts - monitoring, logging, distributed tracing, and metrics - and how to apply them to production systems* Experience with container orchestration and deployment technologies including AWS ECS Fargate and/or Kubernetes* Experience working with relational databases (MySQL proficiency is a plus)* Ability to take ownership of problems and act on them independently in a constantly evolving environmentNice to Have:* Experience with AIOps - using AI/ML-based tooling for anomaly detection, predictive alerting, or automated incident triage* Familiarity with the reliability characteristics of AI/ML-backed services (e.g., LLM inference latency, non-determinism, prompt pipeline observability)* Experience with the JavaScript/Node.js ecosystem* Certified Kubernetes Administrator (CKA) certification* Familiarity with compliance frameworks like SOC 2, ISO 27001, or NIST**AI Experience** (required - at least one of the following):* Hands-on experience using AI-assisted development tools (e.g., GitHub Copilot, Cursor, or similar) to accelerate automation, scripting, or infrastructure work* Demonstrated use of AI/AIOps capabilities for reliability tasks - anomaly detection, incident triage, runbook generation, or alert noise reduction* Familiarity with the operational characteristics of AI/ML-backed services and what it means to make them observable and reliable in production* Demonstrated passion for AI through personal projects, contributions, or continuous learning in the context of infrastructure or reliability engineering**How we support you:**At Drata, our people are our strongest advantage—and we #J-18808-Ljbffr

Apply

Vacancy posted 3 days ago

Similar jobs that could be interesting for youBased on the Senior Site Reliability Engineer in San Francisco, CA vacancy

Senior Site Reliability Engineer (SRE) - AI Inftastructure
$300k
...thousands of H100s, H200s, and B200s, ready for experimentation, full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring...
Senior
Hamilton Barnes Associates Limited
San Francisco, CA
3 days ago
Senior Site Reliability Engineer
$140k - $220k
...About the Job You’ll own reliability and operational excellence for Pylon's production systems. This means designing and implementing... ...scale as we grow. You'll build tooling that makes the entire engineering team more effective, establish on-call rotations and runbooks...
Senior
Pylon
San Francisco, CA
3 days ago
Senior Site Reliability Engineer
...acquisition, and Connor was a machine learning research engineer at Scale AI. The rest of our team comes from... ...redefining go-to-market with state-of-the-art AI. As a Senior SRE, you'll tackle the scaling and reliability challenges that come with adding terabytes of data...
Senior
Unify
San Francisco, CA
3 days ago
Senior Site Reliability Engineer
...about this role, we encourage you to apply. The Role As a Senior Platform Engineer, you are a champion for DevOps and SRE culture and industry... ...are met. What You Will Be Doing Improving production reliability and system resilience within an SRE scoped team Championing...
Senior
Flexible hours
Megaport
Brisbane, CA
1 day ago
Senior Site Reliability Engineer
...Site Reliability Engineer (SRE) We're looking for an experienced Site Reliability Engineer (SRE) to help us scale our platform with reliability, observability, and operational excellence at the core. You'll partner with engineers and data scientists to build, automate...
Senior
Alembic Technologies
San Francisco, CA
1 day ago
Senior Site Reliability Engineer
...experiences. Join our team! We're building a world where Identity belongs to you. The Engineering Opportunity We are looking for an experienced Senior Site Reliability Engineer to join Okta's Emerging Products Group (EPG). Our mission is to build highly reliable...
Senior
Local area
Worldwide
Flexible hours
Okta, Inc.
San Francisco, CA
2 days ago
Senior Site Reliability Engineer
$166.9k - $225.9k
...Job Summary Drata's SRE team operates as both a central engineering function and an embedded reliability practice. You'll be part of a close-knit SRE team... ...organization. What you’ll bring 6+ years of experience in Site Reliability Engineering, Cloud Engineering, or...
Senior
Flexible hours
Drata
San Francisco, CA
3 days ago
Senior Site Reliability Engineer, Fleet Management
$127k - $249k
...The Team Platform Engineering is the department within SRE that is responsible for a range of critical infrastructure and operational... ...fleet, alongside the critical components that ensure cluster reliability and security (e.g., CoreDNS, cert-manager, and Gatekeeper). As...
Senior
Work at office
Local area
Remote work
Worldwide
Flexible hours
MongoDB
San Francisco, CA
2 days ago
Senior Site Reliability Engineer
...global culture at OutSystems!Hybrid Onsite in Menlo Park, CASite Reliability Engineering (SRE) is a discipline that incorporates aspects of software... ...delivering a smooth and frictionless Customer Experience.Site Reliability Engineer RoleAs an SRE at OutSystems here are...
Senior
Immediate start
Remote work
Worldwide
OutSystems
San Francisco, CA
4 days ago
Senior Site Reliability Engineer
$181.69k - $213.75k
...Senior Site Reliability Engineer San Francisco, California; Santa Clara, California; Seattle, WA The Company You'll Join Carta connects founders, investors, and limited partners through world-class software, purpose-built for everyone in venture capital, private...
Senior
Full time
Work at office
Carta
San Francisco, CA
1 day ago
Senior Site Reliability Engineer
$195k - $240k
...Senior Site Reliability Engineer San Francisco (Hybrid) At You.com, we are building the AI Search Infrastructure that powers modern AI systems. Our goal is to create the trusted knowledge layer that agents, applications, and enterprises rely on to retrieve real-...
Senior
Full time
Immediate start
Remote work
Work from home
Flexible hours
Y.O.U.
San Francisco, CA
1 day ago
Senior Site Reliability Engineer
...advanced algorithms that significantly outperforms individual engineers. We combine language models with human ingenuity to push the... ...quality. The Role: We are seeking an experienced Site Reliability Engineer to join our Platform Engineering team in the Bay Area...
Senior
CodeRabbit
San Francisco, CA
23 hours ago
Senior Staff Site Reliability Engineer
$220k - $235k
...are seeking a strategic, high-output Staff/Senior Staff SRE to define the future of our cloud platform and champion engineering excellence across Ironclad. In this role,... ...leadership and strategic direction for the Site Reliability Engineering team and our broader Cloud...
Senior
Full time
Work at office
Jobr
San Francisco, CA
4 days ago
Senior Site Reliability Engineer
$159.2k - $301.6k
...running Graphs on the cloud. In this reliability-focused role, you will own the availability... .... You'll partner with the backend engineers building these APIs to make sure the system... ...Science. ~5-10 years of experience in site reliability engineering, infrastructure,...
Senior
Temporary work
Local area
Worldwide
Adobe
San Francisco, CA
1 day ago
Senior Staff Site Reliability Engineer
$181k - $263k
...and supporting deployments of global products, and providing first line operational support. We are looking for a Senior Staff Site Reliability Engineer who will set the technical direction for reliability engineering across LiveRamp's global infrastructure. This is a...
Senior
Work from home
Flexible hours
Night shift
LiveRamp
San Francisco, CA
3 days ago
Senior Cluster Site Reliability Engineer
...management. We have become a multibillion‑dollar asset manager, and we have ambitious goals for the future. As a Senior Cluster Site Reliability Engineer (SRE), you will help scale our research compute cluster to meet our growing needs, and you will leverage engineering...
Senior
Local area
The Voleon Group
Berkeley, CA
3 days ago
Sr. Site Reliability Engineer
$163k - $203k
...Your role in our mission You will be a senior technical contributor on the SRE team, responsible for the reliability, scalability, and security of Prosper’s Cloud Platform... ...portfolio. This is as much of a platform engineering role as it is SRE role — you will maintain the...
Senior
Work experience placement
Work at office
Local area
Remote work
Flexible hours
2 days per week
Prosper.com
San Francisco, CA
1 day ago
Senior Site Reliability Engineer
$174.92k - $209.91k
...High-Performance Engineer For Site Reliability Engineering Team Fivetran is building data pipelines to power the modern data stack for thousands of companies. Fivetran is looking for a high-performance, experienced engineer to be a part of a team of Site Reliability...
Senior
Full time
Work at office
Remote work
dbt Labs
Oakland, CA
4 days ago
Senior Site Reliability Engineer- San Francisco, CA, the US
...Job Description Job Description Senior Site Reliability Engineer (Payments Infrastructure) Kody is seeking a Senior Site Reliability Engineer to ensure the reliability, availability, scalability, and operational excellence of our global payment platform. You will...
Senior
Kody
San Francisco, CA
14 days ago
Senior Site Reliability Engineer (GPU Clusters) - Hosting
$250k
...across Europe, while now significantly expanding its footprint in the United States. The company is looking for a Senior / Staff Site Reliability Engineer to support and scale large-scale HPC and cloud environments powering GPU-intensive workloads. The role involves...
Senior
Permanent employment
Remote work
San Francisco, CA
a month ago
Site Reliability Engineer
...The role We're looking for a world-class Site Reliability Engineer to ensure the reliability, performance, and scalability of our AI infrastructure platform. You’ll be building and operating the core systems that power agentic AI at scale. Your mission: keep our ultra-...
Blaxel
San Francisco, CA
3 days ago
Site Reliability Engineer
...manifesto. About the Role We're looking for an Infrastructure Engineer to take the lead on scaling our operational resilience as we grow... ...This is a high-impact, high-trust role where you’ll shape how reliability is done - reducing incident load, building internal tooling,...
Worldwide
Shift work
Happy Robot
San Francisco, CA
12 hours ago
Site Reliability Engineer
...millions of daily users while enabling our engineering teams to ship fast. You'll own the... ...building automation and tooling that improves reliability and partnering with engineering to... ...services What you'll bring 5+ years in Site Reliability Engineering, DevOps, or systems...
Work at office
Work from home
Gamma
San Francisco, CA
4 days ago
Site Reliability Engineer (SRE)
$350k
...Site Reliability Engineer (SRE) San Francisco Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. We're building a future where everyone has access to the knowledge and tools to make AI work for their unique...
Local area
Visa sponsorship
Work visa
Relocation package
Thinking Machines Lab
San Francisco, CA
2 days ago
Site Reliability Engineer (SRE)
...globe. Join us on this journey to redefine resource management-and change lives along the way. The Role As a Site Reliability Engineer (SRE) at Air Apps, you will be responsible for ensuring the reliability, availability, and scalability of our systems. You...
Temporary work
Worldwide
Air Apps
San Francisco, CA
3 days ago
Site Reliability Engineer
...an SRE to join our infrastructure team. This role will be responsible for building software to ensure the reliability of our back-end systems, working with engineers who develop them, and planning for our future growth. You will work with our existing production...
Worldwide
Home office
Flexible hours
Superhuman
San Francisco, CA
10 hours ago
Site Reliability Engineer
$260k - $300k
...software agents. We're the makers of Devin, the first AI software engineer, and Windsurf, the AI-native IDE. Together, they represent our... ...than anyone expects. You will own both the production reliability of our user-facing products and the platform engineering that...
Cognition AI
San Francisco, CA
1 day ago
Site Reliability Engineer
...our manifesto. About the Role We're looking for a Site Reliability Engineer to take the lead on scaling our operational resilience as... ...exceptional performance is the passing grade.' Ability trumps seniority. We believe the best teams are built on talent density -...
Worldwide
Shift work
Happy Robot
San Francisco, CA
3 days ago
Site Reliability Engineer
...Open Source LLM Gateway Engineer LiteLLM is an open-source LLM Gateway with 34K+ stars on GitHub and trusted by companies like NASA... ...expanding and seeking our 6th Engineer focused on owning reliability, performance, and infrastructure stability for the LiteLLM proxy...
BerriAI
San Francisco, CA
2 days ago
Site Reliability Engineer
...Site Reliability Engineer Job Location: San Francisco, CA or Charlotte, NC. Job Type: Contract Work with local API development squads, platform teams, product owners, scrum masters, and architects. The SRE ensures that both our internally critical and our externally...
Contract work
Local area
InterSources
San Francisco, CA
1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Site Reliability Engineer. Be the first to apply!