Senior Site Reliability Engineer - Observability
Lambda
Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU. If you'd like to build the world's best AI cloud, join us. *Note: This position requires presence in our San Francisco, San Jose, or Bellevue WA office location 4 days per week; Lambda’s designated work from home day is currently Tuesday. Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance. What You’ll Do Deploy and operate observability platforms for logging, metrics, and distributed tracing. Automate the deployment and operation of these observability systems. Set up monitoring for modern AI/HPC cluster infrastructure. Develop platform software to make observability adoptable and improve product reliability. Lead members of other engineering teams in development of solutions for their monitoring challenges. You Have 8+ years of experience in software engineering, with 3+ years in Go Have 5+ years of experience in Site Reliability Engineering practices Possess proven understanding of Observability tools and practices Have experience with application deployment and monitoring using Kubernetes Have strong experience with modern devops practices Expect quality and reliability from the solutions you build Enjoy collaborating across team boundaries to help our engineering teams meet their observability needs Nice to Have Experience with compute infrastructure monitoring or network monitoring Experience with Prometheus and writing queries in PromQL Experience with messaging systems like NATS Understanding of the OpenTelemetry ecosystem and experience with both OTel instrumentation and the OTel collector Experience with network monitoring for Ethernet and Infiniband Understanding of dashboard design principles Strong understanding of Linux fundamentals and system administration. Experience with infrastructure automation tooling such as Ansible and Terraform Salary Range Information The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. A Final Note You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills. Equal Opportunity Employer Lambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law. #J-18808-Ljbffr Lambda
- Fieldguide is seeking a Senior Site Reliability Engineer to ensure the reliability and scalability of our production systems in San Francisco,... ...teams to define reliability standards and build robust observability practices. Candidates should have at least 5 years of experience...SeniorRemote jobFlexible hours
- ...The TeamPlatform Engineering is the department within SRE that is responsible for a range... ...edge and internal service mesh), and observability and alerting systems.The Fleet Management... ...components that ensure cluster reliability and security (e.g., CoreDNS, cert-manager...SeniorWork at officeLocal areaRemote workWorldwideFlexible hours
$140k - $205k
...Senior Technology Site Reliability Engineer Cooley is seeking a Senior Site Reliability Engineer to join the Infrastructure & Development Operationsteam... ...to build and maintain automated, resilient, and observable systems that support high availability and operational...SeniorFull timeTemporary workWork at officeFlexible hoursWeekend work- US Corp. is seeking a Lead Site Reliability Engineer to spearhead our mission of delivering highly available and performant systems. With an... ...identifying bottlenecks, and implementing robust monitoring and observability solutions using Prometheus and Grafana. As a technical...Senior
$210.6k - $305.1k
...Networking, Security, Collaboration, and Observability portfolios Your Impact As part... ...~ You have led a distributed team of 5+ engineers, can demonstrate strong technical vision... ...insurance. Please see the Cisco careers site to discover more benefits and perks. Employees...SeniorFull timeTemporary workLocal areaFlexible hours$227.2k - $324.5k
...About the Role: Site Reliability Engineering (SRE) at Tubi is not a traditional operations team... ...seeking an experienced and visionary Senior SRE Manager to lead and grow our newly... ...strategy and vision for Tubi's observability, and automation platforms. Partner with...SeniorFull timeContract workTemporary workLocal areaFlexible hours- ...poised to redefine computing. About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI... ...flags, and automated rollback mechanisms Proficient in observability tools and practices including metrics, logging, tracing,...Senior
- ...and onboard services and teams to the reliability tenets. Establish and maintain... ...development teams to build resilient, observable, fault‑tolerant, recoverable, and scalable... .... 6+ years of experience in Site Reliability Engineering, managing infrastructure and services...Senior
- What you’ll do As a Senior Site Reliability Engineer, you’ll work closely with product teams in Spend to deliver and maintain scalable, reliable... ..., and operational readiness. Lead incident response, observability, and automation across critical systems. Own team-level...Senior
- ...that possible. We’re a team of doctors, engineers, designers, researchers, and creatives... ...end-to-end. Improve operational reliability: Identify recurring issues and reliability... ...as familiarity increases. Strengthen observability: Improve dashboards, alerts, logs, and...SeniorWork at officeWorldwide
- ...was a machine learning research engineer at Scale AI. The rest of our team... ...with state-of-the-art AI. As a Senior SRE, you'll tackle the scaling and reliability challenges that come with adding... ...and building the automation and observability that keep Unify fast and reliable...Senior
- ...product, you will find a home at Fieldguide. About the Role As a Senior Site Reliability Engineer (SRE) at Fieldguide, you will be responsible for ensuring the reliability, scalability, and observability of our production systems. You will apply software engineering...SeniorRemote workWork from homeFlexible hours
$230k - $310k
A tech company is seeking an experienced Site Reliability Engineer to ensure the reliability and performance of its production systems across AWS infrastructure. You will build observability tools, lead incident responses, and collaborate on architectural improvements....$60 per hour
Senior Site Reliability Engineer (Copy) Seattle Hybrid (Hybrid location). Full-time. About Us Supio is a trusted AI platform purpose-built for law... ...and hotfix coordination. Build safe, repeatable, and observable workflows. GitHub Operations: Manage GitHub branching strategies...SeniorFull timeWork at officeFlexible hours- # Senior Site Reliability EngineerHybrid - San Francisco**Our Mission & Values:** At Drata, we help... ...SRE team operates as both a central engineering function and an embedded reliability... ...reusable artifacts - SLO templates, observability checklists, alerting standards,...SeniorWork at officeImmediate startWorldwideMonday to FridayFlexible hours
$325k
Engineering at Ivo Engineers At Ivo Are Inventors. Ivo Was First-to-market With... ...hit our SLAs. We’re looking for an Senior or Staff Site level Reliability Engineer as part of the... ...slow the product to a crawl Build observability that answers: what, why and how often...SeniorContract work$166.9k - $225.9k
...team operates as both a central engineering function and an embedded reliability practice. You'll be part of a close... ...artifacts—SLO templates, observability checklists, alerting standards, reference... ...bring 6+ years of experience in Site Reliability Engineering, Cloud...SeniorFlexible hours- Somi AI in San Francisco is looking for a Software Engineer to join our Insights team. You will design and implement solutions that enhance database observability across our systems, collaborating with various teams to ensure performance metrics are effectively reported...Senior
- Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was... ...fabric-level issues that degrade collective operations. Observability: Build deep visibility into GPU utilization, memory...SeniorFull timeRemote work
$181k - $263k
...first line operational support. We are looking for a Senior Staff Site Reliability Engineer who will set the technical direction for reliability engineering... ...internal tooling adopted across teamsExpertise in observability engineering—SLOs, SLI pipelines, and high-signal...SeniorWork from homeFlexible hoursNight shift$127k - $249k
We are looking for an experienced Senior or Staff Engineer for our SRE, InfraSec team, to guide the security of our cloud-based infrastructure... ...‑focused areas, such as runtime scanning, security observability, CSPM, and more Cloud Expertise: Strong experience with...SeniorLocal areaRemote workFlexible hours$232k - $319k
...scale the service with great people and reliable, cost-effective, and efficient... ...Edge networking, K8s platform, CI/CD, Observability, automation platform & tooling.... ...partnership with architects and product engineering Build a world-class observability platform...SeniorPermanent employmentLocal areaWorldwideFlexible hours$175k - $225k
...Senior Backend Engineer In person 5 days/week in San Francisco, Boston, MA, New York. We... ...backend systems that power LangChain's observability and evals platform. You will work on... ...evaluation data. Ensure system reliability through strong testing, monitoring,...SeniorWork at officeFlexible hours$190k - $290k
...Adyen, everything we do is engineered for ambition. For our... .... Customer Developer Observability Team We believe that our... ...being able to shift to highly reliable systems Building and maintaining... ...Currently working as a Senior Software Engineer or at a...SeniorH1bWork at officeVisa sponsorshipFlexible hoursShift work- ...systems. As a Staff Platform Engineer, you will play a critical... ...leadership role. You will own reliability for major platform domains,... ...Establish and enhance centralized Observability and Monitoring platforms and... ..., Platform Engineering, or Site Reliability Engineering role...Senior
$175k - $240k
...intelligent agents ubiquitous. We build the foundation for agent engineering in the real world, helping developers move from prototypes to... ...the real world. Today, our platform includes LangSmith (Observability, Evaluation, Deployment, Fleet, and Sandboxes), our open...SeniorWork at officeFlexible hours$155k - $195k
...across their organization. Founded in 2023, LangChain powers top engineering teams at companies like Replit, Lovable, Clay, Klarna,... ...working on our enterprise platform product for LLM application observability, testing, and debugging. You will: Develop new user-facing features...Senior$170k - $195k
...ubiquitous. We provide the agent engineering platform and open source... ...developers need to ship reliable agents fast. Our open source... ...granular control. LangSmith offers observability, evaluation, and deployment... .... We are looking for a Senior Backend Engineer to join us....SeniorWorldwideFlexible hours$160k - $270k
...in container orchestration. Responsibilities include establishing security controls, improving developer experience, and owning observability processes. The position offers a competitive salary ranging from $160,000 - $270,000 along with excellent benefits including health...Senior$175k - $250k
I did my part and supported the Regular Toilet is seeking a Site Reliability Engineer to enhance the reliability and performance of our systems at WorkOS. As a key member of the SRE team, you will handle critical responsibilities like improving incident responses and collaborating...Remote jobFlexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior Site Reliability Engineer - Observability. Be the first to apply!
- site reliability engineer San Francisco, CA
- site reliability engineer sre San Francisco, CA
- site reliability engineer remote San Francisco, CA
- senior game producer San Francisco, CA
- senior manager process engineering San Francisco, CA
- senior manufacturing engineer San Francisco, CA
- senior director fp&a San Francisco, CA
- senior manager clinical operations San Francisco, CA
- senior lead project manager San Francisco, CA
- senior manager quality engineering San Francisco, CA

