Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Infrastructure Engineer (Observability)

Lightning AI

Infrastructure Engineer (Observability)

Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems—designed to take ideas from research to production with less friction.

Through our merger with Voltage Park, a neocloud and AI Factory, Lightning AI combines developer-first software with cost-efficient, large-scale compute. Teams get the tools they need for experimentation, training, and production inference, with security, observability, and control built in.

We serve solo researchers, startups, and large enterprises. Lightning AI operates globally with offices in New York City, San Francisco, Seattle, and London, and is backed by Coatue, Index Ventures, Bain Capital Ventures, and Firstminute.

What You'll Do
Observability Platform & Productization
  • Own and evolve a scalable observability platform spanning metrics, logs, traces, and events
  • Drive the productization of observability capabilities for both internal teams and external customers
  • Design multi-tenant observability systems with scoped access, RBAC, and customer-facing visibility
  • Continuously improve observability systems to keep pace with rapid infrastructure buildouts
Telemetry & Data Pipelines
  • Design and operate telemetry pipelines ingesting data from GPUs, CPUs, networking (Ethernet & InfiniBand), containers, APIs, and BMC/Redfish
  • Build systems to correlate signals across infrastructure layers to enable faster debugging and root cause analysis
  • Implement streaming and real-time data pipelines using tools such as Kafka, OTEL, Promtail, or similar
Alerting, Reliability & Insights
  • Design and implement noise-resistant alerting systems to improve signal quality and reduce operational load
  • Create dashboards and alerting for InfraOps, Engineering, and Customer Success teams
  • Build automated insights and enable proactive detection, forecasting, and system health visibility at scale
Systems & Infrastructure Engineering
  • Contribute to broader infrastructure engineering projects beyond observability
  • Partner with infrastructure and platform teams to embed observability into core systems and workflows
  • Support large-scale, distributed systems across compute, networking, and storage environments
Cross-Functional Collaboration
  • Work closely with customer-facing teams to deliver external observability experiences
  • Collaborate with engineering, operations, and support teams to improve system transparency and reliability
  • Help define best practices for observability across the organization
What You'll Need
Required Qualifications
  • 5+ years of experience in infrastructure engineering, SRE, or observability-focused roles
  • Strong experience with monitoring systems such as Prometheus, Grafana, ELK, or VictoriaMetrics
  • Experience building and operating observability platforms at scale
  • Proficiency in Python, Go, or bash for automation and data integration
  • Familiarity with containerized environments and Kubernetes observability
  • Experience with streaming telemetry pipelines (Kafka, OTEL, Promtail, or equivalent)
  • Experience with multi-tenant monitoring architectures
  • Strong written and verbal communication skills
Benefits and Perks

We offer a comprehensive and competitive benefits package designed to support our employees' health, well-being, and long-term success. Benefits may vary by location, team, and role.

Benefits include:

  • Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.)
  • Retirement and financial wellness support (U.S.); Pension contribution (U.K.)
  • Generous paid time off, plus holidays
  • Paid parental leave
  • Professional development support
  • Wellness and work-from-home stipends
  • Flexible work environment
Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Infrastructure Engineer (Observability) in San Francisco, CA vacancy
  • About Fluidstack At Fluidstack, we’re building the infrastructure for abundant intelligence. We partner with top AI labs, governments...  ...next. About the Role Fluidstack is seeking a Network Engineer, Reliability & Observability to serve as a reliability engineer championing and... 
    Suggested

    Fluidstack

    San Francisco, CA
    2 days ago
  • A leading infrastructure company is seeking a Network Engineer, Reliability & Observability to enhance AI network reliability. This role involves developing QA processes, serverless workflows, and collaborating with cross-functional teams. Ideal candidates have over 5 years... 
    Suggested

    Fluidstack

    San Francisco, CA
    2 days ago
  • $300 per month

     ...Staff Software Engineer Crusoe is on a mission to accelerate the abundance of energy...  ...As the only vertically integrated AI infrastructure company built from the ground up, we...  ...architecture and evolution of Crusoe's observability platform at scale. In this role, you will... 
    Suggested
    Temporary work

    Crusoe

    San Francisco, CA
    2 days ago
  • $175k - $225k

     ...intelligent agents ubiquitous. We build the foundation for agent engineering in the real world, helping developers move from prototypes to...  ...the real world. Today, our platform includes LangSmith (Observability, Evaluation, Deployment, Fleet, and Sandboxes), our open... 
    Suggested
    Work at office
    Flexible hours

    LangChain, Inc

    San Francisco, CA
    4 days ago
  • Zyphra in San Francisco is hiring a Platform Engineer responsible for designing and maintaining robust infrastructure. You will collaborate with teams to enhance system observability, manage cloud environments and ensure deployment safety. The ideal candidate has strong... 
    Suggested

    Zyphra

    San Francisco, CA
    1 day ago
  • $175k - $240k

     ...Senior Fullstack Engineer In person 5 days/week in San Francisco We're looking for a Senior Fullstack Engineer for our commercial product LangSmith, an observability and evals platform. In this role, you'll have the opportunity to shape the technical direction of... 
    Work at office
    Flexible hours

    LangChain

    San Francisco, CA
    2 days ago
  • $175k - $240k

     ...intelligent agents ubiquitous. We build the foundation for agent engineering in the real world, helping developers move from prototypes to...  ...the real world. Today, our platform includes LangSmith (Observability, Evaluation, Deployment, Fleet, and Sandboxes), our open... 
    Work at office
    Flexible hours

    LangChain, Inc

    San Francisco, CA
    2 days ago
  • Somi AI in San Francisco is looking for a Software Engineer to join our Insights team. You will design and implement solutions that enhance database observability across our systems, collaborating with various teams to ensure performance metrics are effectively reported... 

    Somi AI

    San Francisco, CA
    1 day ago
  • $170k - $195k

     ...to make intelligent agents ubiquitous. We provide the agent engineering platform and open source frameworks developers need to ship reliable...  ...agents with speed and granular control. LangSmith offers observability, evaluation, and deployment for rapid iteration, enabling... 
    Worldwide
    Flexible hours

    LangChain

    San Francisco, CA
    3 days ago
  • $160k - $270k

     ...looking for a DevSecOps leader to build and manage secure cloud infrastructure on GCP. The ideal candidate will have over 8 years of...  ...security controls, improving developer experience, and owning observability processes. The position offers a competitive salary ranging... 

    Mandolin

    San Francisco, CA
    2 days ago
  • $140k - $175k

     ...to make intelligent agents ubiquitous. We provide the agent engineering platform and open source frameworks developers need to ship reliable...  ...agents with speed and granular control. LangSmith offers observability, evaluation, and deployment for rapid iteration, enabling... 
    Worldwide
    Flexible hours

    LangChain

    San Francisco, CA
    3 days ago
  • $155k - $195k

     ...across their organization. Founded in 2023, LangChain powers top engineering teams at companies like Replit, Lovable, Clay, Klarna,...  ...working on our enterprise platform product for LLM application observability, testing, and debugging. You will: Develop new user-facing features... 

    LangChain

    San Francisco, CA
    2 days ago
  • $140k - $175k

    A leading technology firm in San Francisco is seeking a Fullstack Engineer for their observability platform. The role involves developing features across a Go, Python, and Typescript stack while collaborating with internal teams and enterprise customers. Ideal candidates... 

    LangChain

    San Francisco, CA
    11 hours ago
  •  ...digital business. By weaving together advances in cloud infrastructure, automation and analytics, and software delivery, we...  ...enrichment of ideas and perspectives at AHEAD.  AHEAD’s Sr. Observability Solutions Engineers are the technical experts that collaborate with our... 
    Work at office

    AHEAD

    San Francisco, CA
    2 days ago
  • $215k - $320k

     ...choice. At Adyen, everything we do is engineered for ambition. For our teams, we create...  ...MANAGER, DEVELOPER PLATFORM & OBSERVABILITY THE MISSION We don’t just build...  ...developers. You aren't just managing infrastructure; you are building a product suite that... 
    Full time
    H1b
    Work at office

    Adyen

    San Francisco, CA
    4 hours ago
  • $191k - $250k

     ...At Descript, we believe that software engineers should own the reliability and performance...  ...they ship to production, so as an Infrastructure Engineer, you will drive projects that...  ...clusters, networking, databases, and observability systems.Champion best practices during... 
    Work at office
    Remote work
    Flexible hours

    Descript

    San Francisco, CA
    4 days ago
  • $180k - $240k

     ...time Location Type Remote Department Engineering Compensation $180K – $240K • Offers Equity...  ...automation to drive a highly dynamic infrastructure. The role is a unique blend of...  ...What you’ll do: By 30 Days: Use your observability background to help scale our existing... 
    Full time
    Remote work

    Pantera Capital

    San Francisco, CA
    1 day ago
  • $180k - $200k

     ...Infrastructure Engineer (Storage) New York, New York, United States; Remote; San Francisco, California, United States; Seattle, Washington...  ..., training, and production inference, with security, observability, and control built in. We serve solo researchers, startups... 
    Remote work
    Work from home
    Flexible hours

    Lightning AI

    San Francisco, CA
    2 days ago
  • The Role As Triumph's first dedicated Infrastructure Engineer, you'll own the foundation that everything else runs on. Until now, our infrastructure...  ...the first person whose entire mandate is to make it fast, observable, and bulletproof. Our platform touches gaming, finance,... 

    Triumph Arcade, Inc

    San Francisco, CA
    2 days ago
  • $130k - $240k

     ...Maxana is seeking an experienced Infrastructure Engineer for a confidential client — a fast-growing AI company. In this role you will build...  ...systems, and cloud-native platforms Improve reliability, observability, and performance across the platform layer Collaborate... 
    Flexible hours

    Maxana

    San Francisco, CA
    1 day ago
  •  ...About HappyRobot HappyRobot is the infrastructure for enterprises to build and...  ...We're looking for a Infrastructure Engineer to take the lead on scaling our operational...  ...as we grow. You'll own the stability, observability, and debugging workflows that keep our... 
    Worldwide
    Shift work

    Happy Robot

    San Francisco, CA
    3 days ago
  • $125k - $195k

     ...a small team of exceptional, hands-on engineers to make this happen. Mechanical, electrical...  ...0 years. About the Team The Infrastructure team is the backbone of Atomic Semi....  ...storage, VPNs Scale our observability platform: Build systems to ingest and... 
    Work at office
    Visa sponsorship
    Night shift

    Atomic Semi

    San Francisco, CA
    11 hours ago
  • $200k

     ...We do this by building the financial infrastructure that makes it easier for more people to...  ...we build. As a Senior Infrastructure Engineer, you will shape the platform infrastructure...  ...— from deployment pipelines to observability to developer tooling. Architect systems... 
    Work at office
    2 days per week

    AL Talent, Inc. (d/b/a Wellfound)

    San Francisco, CA
    3 days ago
  •  ...holders through transfers, to the platform infrastructure powering it all behind the scenes....  ...: We’relooking for an Infrastructure Engineer to help bring our Azure infrastructure...  ...infrastructure, building automation, improving observability, and collaborating with InfoSec on... 
    Work at office
    Remote work
    Flexible hours
    Shift work
    2 days per week

    GrabJobs

    San Francisco, CA
    1 day ago
  •  ...Google Workspace. What you'll do As a Forward Deployed Infrastructure Engineer at Sierra, and a founding member of this function, you...  ...operations, playbooks, and tooling. Familiarity with observability and incident management in distributed systems (Prometheus... 
    Full time
    Flexible hours

    Sierra

    San Francisco, CA
    11 hours ago
  •  ...Benchmark, and First Round Capital. The Opportunity As an Infrastructure Engineer at Reducto, you will influence every aspect of our...  ...deployments. Implementing robust monitoring, alerting, and observability systems to ensure system health, performance, and uptime... 
    Work at office
    Local area

    Reducto

    San Francisco, CA
    2 days ago
  •  ...Kernel Infrastructure Engineer Kernel is crazy fast, open source browser infrastructure for AI agents. We handle autoscaling, observability, and the messy details of web interaction, so developers can focus on what their agents do instead of how they do it. Teams... 
    Relocation package

    Kernel (yc S25)

    San Francisco, CA
    3 days ago
  • $200k

     ...to their business. You’ll own core infrastructure that turns our research advantage into...  ...logging, and SOC2 compliance • Build deep observability. Instrument infrastructure with...  ...re Looking For • 4+ years of backend engineering experience, with meaningful time spent... 
    Work experience placement
    Visa sponsorship
    Flexible hours
    Shift work

    People Culture Talent

    San Francisco, CA
    1 day ago
  • $100k - $300k

     ...Senior Robot Infrastructure Engineer San Francisco Company Overview At Skild AI, we are building the world's first general purpose...  ...will span on-device software, cloud infrastructure, fleet observability, OTA updates, and reliable connectivity in real-world environments... 
    Local area

    Skild AI

    San Francisco, CA
    3 days ago
  •  ...supports dramatically high volume transactions and now needs the infrastructure to match. Responsibilities Automate payout...  ...settlement ingestion and reconciliation pipelines Improve observability across the full payout to processor to bank flow Fix recurring... 

    Career Movement

    Daly City, CA
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Infrastructure Engineer (Observability). Be the first to apply!