Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior Site Reliability Engineer - Observability

Lambda

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU. If you'd like to build the world's best AI cloud, join us. *Note: This position requires presence in our San Francisco, San Jose, or Bellevue WA office location 4 days per week; Lambda’s designated work from home day is currently Tuesday. Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance. What You’ll Do Deploy and operate observability platforms for logging, metrics, and distributed tracing. Automate the deployment and operation of these observability systems. Set up monitoring for modern AI/HPC cluster infrastructure. Develop platform software to make observability adoptable and improve product reliability. Lead members of other engineering teams in development of solutions for their monitoring challenges. You Have 8+ years of experience in software engineering, with 3+ years in Go Have 5+ years of experience in Site Reliability Engineering practices Possess proven understanding of Observability tools and practices Have experience with application deployment and monitoring using Kubernetes Have strong experience with modern devops practices Expect quality and reliability from the solutions you build Enjoy collaborating across team boundaries to help our engineering teams meet their observability needs Nice to Have Experience with compute infrastructure monitoring or network monitoring Experience with Prometheus and writing queries in PromQL Experience with messaging systems like NATS Understanding of the OpenTelemetry ecosystem and experience with both OTel instrumentation and the OTel collector Experience with network monitoring for Ethernet and Infiniband Understanding of dashboard design principles Strong understanding of Linux fundamentals and system administration. Experience with infrastructure automation tooling such as Ansible and Terraform Salary Range Information The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. A Final Note You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills. Equal Opportunity Employer Lambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law. #J-18808-Ljbffr Lambda

Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Senior Site Reliability Engineer - Observability in San Francisco, CA vacancy
  • Fieldguide is seeking a Senior Site Reliability Engineer to ensure the reliability and scalability of our production systems in San Francisco,...  ...teams to define reliability standards and build robust observability practices. Candidates should have at least 5 years of experience... 
    Senior
    Remote job
    Flexible hours

    Fieldguide

    San Francisco, CA
    5 days ago
  •  ...The TeamPlatform Engineering is the department within SRE that is responsible for a range...  ...edge and internal service mesh), and observability and alerting systems.The Fleet Management...  ...components that ensure cluster reliability and security (e.g., CoreDNS, cert-manager... 
    Senior
    Work at office
    Local area
    Remote work
    Worldwide
    Flexible hours

    MongoDB

    San Francisco, CA
    1 day ago
  • $140k - $205k

     ...Senior Technology Site Reliability Engineer Cooley is seeking a Senior Site Reliability Engineer to join the Infrastructure & Development Operationsteam...  ...to build and maintain automated, resilient, and observable systems that support high availability and operational... 
    Senior
    Full time
    Temporary work
    Work at office
    Flexible hours
    Weekend work

    Cooley

    San Francisco, CA
    1 day ago
  • US Corp. is seeking a Lead Site Reliability Engineer to spearhead our mission of delivering highly available and performant systems. With an...  ...identifying bottlenecks, and implementing robust monitoring and observability solutions using Prometheus and Grafana. As a technical... 
    Senior

    Axiom Pursuits

    San Francisco, CA
    1 day ago
  • $210.6k - $305.1k

     ...Networking, Security, Collaboration, and Observability portfolios Your Impact As part...  ...~ You have led a distributed team of 5+ engineers, can demonstrate strong technical vision...  ...insurance. Please see the Cisco careers site to discover more benefits and perks. Employees... 
    Senior
    Full time
    Temporary work
    Local area
    Flexible hours

    Cisco

    San Francisco, CA
    1 day ago
  • $227.2k - $324.5k

     ...About the Role: Site Reliability Engineering (SRE) at Tubi is not a traditional operations team...  ...seeking an experienced and visionary Senior SRE Manager to lead and grow our newly...  ...strategy and vision for Tubi's observability, and automation platforms. Partner with... 
    Senior
    Full time
    Contract work
    Temporary work
    Local area
    Flexible hours

    Tubi

    San Francisco, CA
    1 day ago
  •  ...poised to redefine computing. About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI...  ...flags, and automated rollback mechanisms Proficient in observability tools and practices including metrics, logging, tracing,... 
    Senior

    deCircle

    San Francisco, CA
    1 day ago
  •  ...and onboard services and teams to the reliability tenets. Establish and maintain...  ...development teams to build resilient, observable, fault‑tolerant, recoverable, and scalable...  .... 6+ years of experience in Site Reliability Engineering, managing infrastructure and services... 
    Senior

    OutSystems, Inc.

    San Francisco, CA
    2 days ago
  • What you’ll do As a Senior Site Reliability Engineer, you’ll work closely with product teams in Spend to deliver and maintain scalable, reliable...  ..., and operational readiness. Lead incident response, observability, and automation across critical systems. Own team-level... 
    Senior

    Airwallex-

    San Francisco, CA
    1 day ago
  •  ...that possible. We’re a team of doctors, engineers, designers, researchers, and creatives...  ...end-to-end. Improve operational reliability: Identify recurring issues and reliability...  ...as familiarity increases. Strengthen observability: Improve dashboards, alerts, logs, and... 
    Senior
    Work at office
    Worldwide

    Heidi Health Ltd

    San Francisco, CA
    2 days ago
  •  ...was a machine learning research engineer at Scale AI. The rest of our team...  ...with state-of-the-art AI. As a Senior SRE, you'll tackle the scaling and reliability challenges that come with adding...  ...and building the automation and observability that keep Unify fast and reliable... 
    Senior

    Unify

    San Francisco, CA
    2 days ago
  •  ...product, you will find a home at Fieldguide. About the Role As a Senior Site Reliability Engineer (SRE) at Fieldguide, you will be responsible for ensuring the reliability, scalability, and observability of our production systems. You will apply software engineering... 
    Senior
    Remote work
    Work from home
    Flexible hours

    Fieldguide

    San Francisco, CA
    5 days ago
  • $230k - $310k

    A tech company is seeking an experienced Site Reliability Engineer to ensure the reliability and performance of its production systems across AWS infrastructure. You will build observability tools, lead incident responses, and collaborate on architectural improvements.... 

    Gamma

    San Francisco, CA
    3 days ago
  • $60 per hour

    Senior Site Reliability Engineer (Copy) Seattle Hybrid (Hybrid location). Full-time. About Us Supio is a trusted AI platform purpose-built for law...  ...and hotfix coordination. Build safe, repeatable, and observable workflows. GitHub Operations: Manage GitHub branching strategies... 
    Senior
    Full time
    Work at office
    Flexible hours

    Bonfirevc

    San Francisco, CA
    2 days ago
  • # Senior Site Reliability EngineerHybrid - San Francisco**Our Mission & Values:** At Drata, we help...  ...SRE team operates as both a central engineering function and an embedded reliability...  ...reusable artifacts - SLO templates, observability checklists, alerting standards,... 
    Senior
    Work at office
    Immediate start
    Worldwide
    Monday to Friday
    Flexible hours

    Careers at Drata

    San Francisco, CA
    3 days ago
  • $325k

    Engineering at Ivo Engineers At Ivo Are Inventors. Ivo Was First-to-market With...  ...hit our SLAs. We’re looking for an Senior or Staff Site level Reliability Engineer as part of the...  ...slow the product to a crawl Build observability that answers: what, why and how often... 
    Senior
    Contract work

    Icehouseventures

    San Francisco, CA
    1 day ago
  • $166.9k - $225.9k

     ...team operates as both a central engineering function and an embedded reliability practice. You'll be part of a close...  ...artifacts—SLO templates, observability checklists, alerting standards, reference...  ...bring 6+ years of experience in Site Reliability Engineering, Cloud... 
    Senior
    Flexible hours

    Drata

    San Francisco, CA
    3 days ago
  • Somi AI in San Francisco is looking for a Software Engineer to join our Insights team. You will design and implement solutions that enhance database observability across our systems, collaborating with various teams to ensure performance metrics are effectively reported... 
    Senior

    Somi AI

    San Francisco, CA
    4 days ago
  • Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was...  ...fabric-level issues that degrade collective operations. Observability: Build deep visibility into GPU utilization, memory... 
    Senior
    Full time
    Remote work

    Andromeda

    San Francisco, CA
    2 days ago
  • $181k - $263k

     ...first line operational support. We are looking for a Senior Staff Site Reliability Engineer who will set the technical direction for reliability engineering...  ...internal tooling adopted across teamsExpertise in observability engineering—SLOs, SLI pipelines, and high-signal... 
    Senior
    Work from home
    Flexible hours
    Night shift

    Liveramp

    San Francisco, CA
    5 days ago
  • $127k - $249k

    We are looking for an experienced Senior or Staff Engineer for our SRE, InfraSec team, to guide the security of our cloud-based infrastructure...  ...‑focused areas, such as runtime scanning, security observability, CSPM, and more Cloud Expertise: Strong experience with... 
    Senior
    Local area
    Remote work
    Flexible hours

    Insider, Inc.

    San Francisco, CA
    2 days ago
  • $232k - $319k

     ...scale the service with great people and reliable, cost-effective, and efficient...  ...Edge networking, K8s platform, CI/CD, Observability, automation platform & tooling....  ...partnership with architects and product engineering Build a world-class observability platform... 
    Senior
    Permanent employment
    Local area
    Worldwide
    Flexible hours

    Okta, Inc.

    San Francisco, CA
    2 days ago
  • $175k - $225k

     ...Senior Backend Engineer In person 5 days/week in San Francisco, Boston, MA, New York. We...  ...backend systems that power LangChain's observability and evals platform. You will work on...  ...evaluation data. Ensure system reliability through strong testing, monitoring,... 
    Senior
    Work at office
    Flexible hours

    LangChain

    San Francisco, CA
    5 days ago
  • $190k - $290k

     ...Adyen, everything we do is engineered for ambition. For our...  .... Customer Developer Observability Team We believe that our...  ...being able to shift to highly reliable systems Building and maintaining...  ...Currently working as a Senior Software Engineer or at a... 
    Senior
    H1b
    Work at office
    Visa sponsorship
    Flexible hours
    Shift work

    Adyen

    San Francisco, CA
    1 day ago
  •  ...systems. As a Staff Platform Engineer, you will play a critical...  ...leadership role. You will own reliability for major platform domains,...  ...Establish and enhance centralized Observability and Monitoring platforms and...  ..., Platform Engineering, or Site Reliability Engineering role... 
    Senior

    Saviynt

    San Francisco, CA
    2 days ago
  • $175k - $240k

     ...intelligent agents ubiquitous. We build the foundation for agent engineering in the real world, helping developers move from prototypes to...  ...the real world. Today, our platform includes LangSmith (Observability, Evaluation, Deployment, Fleet, and Sandboxes), our open... 
    Senior
    Work at office
    Flexible hours

    LangChain, Inc

    San Francisco, CA
    5 days ago
  • $155k - $195k

     ...across their organization. Founded in 2023, LangChain powers top engineering teams at companies like Replit, Lovable, Clay, Klarna,...  ...working on our enterprise platform product for LLM application observability, testing, and debugging. You will: Develop new user-facing features... 
    Senior

    LangChain

    San Francisco, CA
    5 days ago
  • $170k - $195k

     ...ubiquitous. We provide the agent engineering platform and open source...  ...developers need to ship reliable agents fast. Our open source...  ...granular control. LangSmith offers observability, evaluation, and deployment...  .... We are looking for a Senior Backend Engineer to join us.... 
    Senior
    Worldwide
    Flexible hours

    LangChain

    San Francisco, CA
    1 day ago
  • $160k - $270k

     ...in container orchestration. Responsibilities include establishing security controls, improving developer experience, and owning observability processes. The position offers a competitive salary ranging from $160,000 - $270,000 along with excellent benefits including health... 
    Senior

    Mandolin

    San Francisco, CA
    5 days ago
  • $175k - $250k

    I did my part and supported the Regular Toilet is seeking a Site Reliability Engineer to enhance the reliability and performance of our systems at WorkOS. As a key member of the SRE team, you will handle critical responsibilities like improving incident responses and collaborating... 
    Remote job
    Flexible hours

    I did my part and supported the Regular Toilet

    San Francisco, CA
    2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Site Reliability Engineer - Observability. Be the first to apply!