Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Infrastructure Engineer (Observability)

$180k - $200k

Lightning AI

Infrastructure Engineer (Observability)

Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems—designed to take ideas from research to production with less friction.

Through our merger with Voltage Park, a neocloud and AI Factory, Lightning AI combines developer-first software with cost-efficient, large-scale compute. Teams get the tools they need for experimentation, training, and production inference, with security, observability, and control built in.

We serve solo researchers, startups, and large enterprises. Lightning AI operates globally with offices in New York City, San Francisco, Seattle, and London, and is backed by Coatue, Index Ventures, Bain Capital Ventures, and Firstminute.

What We're Looking For

Lightning AI is seeking an Observability Infrastructure Engineer to join our Infrastructure Engineering team.

In this role, you will own and evolve observability systems across large-scale, GPU-enabled bare-metal infrastructure. You'll operate at the intersection of infrastructure, data, and product, building platforms for metrics, logs, traces, and alerting that power both internal operations and customer-facing visibility.

You will play a key role in productizing observability, enabling scalable, multi-tenant monitoring experiences while keeping pace with rapid infrastructure buildouts. This includes designing telemetry pipelines, improving signal quality, and delivering actionable insights that ensure reliability and transparency across our platform.

We're flexible on location for this team. This role can work hybrid out of one of our US-based hubs (Seattle, NYC, or SF) or fully remote within the U.S., with occasional company and team offsites. We are not able to provide visa sponsorship for this position at this time.

What You'll Do
Observability Platform & Productization
  • Own and evolve a scalable observability platform spanning metrics, logs, traces, and events
  • Drive the productization of observability capabilities for both internal teams and external customers
  • Design multi-tenant observability systems with scoped access, RBAC, and customer-facing visibility
  • Continuously improve observability systems to keep pace with rapid infrastructure buildouts
Telemetry & Data Pipelines
  • Design and operate telemetry pipelines ingesting data from GPUs, CPUs, networking (Ethernet & InfiniBand), containers, APIs, and BMC/Redfish
  • Build systems to correlate signals across infrastructure layers to enable faster debugging and root cause analysis
  • Implement streaming and real-time data pipelines using tools such as Kafka, OTEL, Promtail, or similar
Alerting, Reliability & Insights
  • Design and implement noise-resistant alerting systems to improve signal quality and reduce operational load
  • Create dashboards and alerting for InfraOps, Engineering, and Customer Success teams
  • Build automated insights and enable proactive detection, forecasting, and system health visibility at scale
Systems & Infrastructure Engineering
  • Contribute to broader infrastructure engineering projects beyond observability
  • Partner with infrastructure and platform teams to embed observability into core systems and workflows
  • Support large-scale, distributed systems across compute, networking, and storage environments
Cross-Functional Collaboration
  • Work closely with customer-facing teams to deliver external observability experiences
  • Collaborate with engineering, operations, and support teams to improve system transparency and reliability
  • Help define best practices for observability across the organization
What You'll Need
Required Qualifications
  • 5+ years of experience in infrastructure engineering, SRE, or observability-focused roles
  • Strong experience with monitoring systems such as Prometheus, Grafana, ELK, or VictoriaMetrics
  • Experience building and operating observability platforms at scale
  • Proficiency in Python, Go, or bash for automation and data integration
  • Familiarity with containerized environments and Kubernetes observability
  • Experience with streaming telemetry pipelines (Kafka, OTEL, Promtail, or equivalent)
  • Experience with multi-tenant monitoring architectures
  • Strong written and verbal communication skills
Ideal Experience
  • Experience with GPU observability, particularly NVIDIA DCGM
  • Experience monitoring large-scale GPU or HPC clusters
  • Familiarity with InfiniBand fabric observability
  • Experience building customer-facing or productized infrastructure systems
  • Experience with correlation engines, RCA workflows, or predictive alerting systems
  • Broad exposure to infrastructure domains including networking, storage, and provisioning
Compensation

We are committed to offering competitive compensation that reflects the value each team member brings to our mission. Final offers are based on factors such as experience, skills, geographic location, and role expectations. In addition to base salary, our total rewards package for eligible roles includes a discretionary bonus, a meaningful equity component, and comprehensive benefits.

The anticipated annual base salary range for this role is:

$180,000 - $200,000 USD

Benefits and Perks

We offer a comprehensive and competitive benefits package designed to support our employees' health, well-being, and long-term success. Benefits may vary by location, team, and role.

Benefits include:

  • Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.)
  • Retirement and financial wellness support (U.S.); Pension contribution (U.K.)
  • Generous paid time off, plus holidays
  • Paid parental leave
  • Professional development support
  • Wellness and work-from-home stipends
  • Flexible work environment

At Lightning AI, we are committed to fostering an inclusive and diverse workplace. We believe that diverse teams drive innovation and create better products. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity, national origin, age, disability, veteran status, or any other protected characteristic. We are dedicated to building a culture where everyone can thrive and contribute to their fullest potential.

Vacancy posted 6 days ago
Similar jobs that could be interesting for youBased on the Infrastructure Engineer (Observability) in New York, NY vacancy
  • $139k - $204k

     ...Senior Engineer, Network Observability Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA CoreWeave is The Essential Cloud for AI...  ...startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate... 
    Suggested
    Temporary work
    Casual work
    Work at office
    Flexible hours

    CoreWeave

    New York, NY
    3 days ago
  • Persona is seeking a new engineer to own edge networking configuration and ensure reliable traffic handling. In this role, you will manage DNS, evolve the ingress layer, and enhance network observability while supporting regional growth. The ideal candidate will have a... 
    Suggested
    Full time

    Persona

    New York, NY
    1 day ago
  • $120k - $150k

    Neuberger's Technology team is seeking an Observability Engineer to lead and evolve our observability strategy across cloud and on-premises...  ...business-critical systems, including external websites and key infrastructure components (e.g., firewalls, OpenShift). You will design... 
    Suggested
    Work at office
    Local area
    Worldwide
    Shift work

    Neuberger

    New York, NY
    1 day ago
  •  ...solutions, ensuring reliability, security, and cost efficiency. This fully remote position focuses on building scalable architectures, observability solutions, and leading Agile methodologies, all while collaborating with a diverse and inclusive team dedicated to excellence.... 
    Suggested
    Remote work

    Caylent

    New York, NY
    20 hours ago
  • Framework Ventures is seeking a skilled Cloud Engineer with expertise in observability and Datadog for a full-time remote position. In this role, you will manage cloud architecture and production operations, working with Agile teams on DevOps practices. Candidates should... 
    Suggested
    Remote job
    Full time

    Framework Ventures

    New York, NY
    4 days ago
  • Palantir is seeking a Senior Software Engineer for their New York office to own the observability platform. The successful candidate will work on log ingestion, processing, and monitoring solutions, while collaborating with leadership to define technical strategies. Ideal... 
    Work at office
    Flexible hours

    jobs.frontdoordefense.com - Jobboard

    New York, NY
    2 days ago
  • LGBT Great in New York is seeking an Observability Engineer to lead the observability strategy across cloud and on-premises environments. This role requires strong Datadog engineering skills, scripting experience, and the ability to drive the migration from legacy systems... 
    Work at office

    LGBT Great

    New York, NY
    4 days ago
  • A technology company based in the United States is seeking a Sr. Platform Engineer to manage AWS, GCP, and cloud infrastructure. In this role, you will plan monitoring and observability mechanisms, develop tooling in Rust, and ensure operations meet reliability standards... 
    Remote job
    Flexible hours

    3Box Labs

    New York, NY
    4 days ago
  • $175k - $240k

     ...Senior Fullstack Engineer In person 5 days/week in San Francisco We're looking for a Senior Fullstack Engineer for our commercial product LangSmith, an observability and evals platform. In this role, you'll have the opportunity to shape the technical direction of... 
    Work at office
    Flexible hours

    LangChain

    New York, NY
    2 days ago
  • Helius is seeking a Staff Platform Engineer to design and implement observability systems from the ground up. In this role, you'll architect new pipelines for metrics, logs, and performance debugging, ensuring reliability and scaling. With 8+ years of programming expertise... 
    Remote job

    Helius

    New York, NY
    4 days ago
  • Technical Skills Azure DevOps (repos, pipelines) CI/CD pipelines and Infrastructure-as-Code Docker containers (JavaScript, Python) Grafana or observability tools SonarQube (code quality/security) JFrog Artifactory AI-assisted tools (e.g., GitHub Copilot) Internal Developer... 

    Apex Systems

    New York, NY
    4 days ago
  • $155k - $195k

     ...across their organization. Founded in 2023, LangChain powers top engineering teams at companies like Replit, Lovable, Clay, Klarna,...  ...working on our enterprise platform product for LLM application observability, testing, and debugging. You will: Develop new user-facing features... 

    Langchain

    New York, NY
    4 days ago
  • $128k - $160k

    A leading observability company is seeking a Solution Engineer based in the United States. The successful candidate will play a critical role in the sales process, providing technical support, managing demonstrations, and collaborating with various teams. Ideal candidates... 

    Dynatrace LLC

    Brooklyn, NY
    3 days ago
  •  ...Infrastructure Engineer & SRE Superblocks is building the frontier platform that enables a billion non-engineers to create software with...  ...systems with strict security and data guarantees Deep observability across AI workflows and infrastructure Enterprise grade... 

    Superblocks

    New York, NY
    6 days ago
  • $190.4k - $285.6k

     ...Infrastructure Engineer, Privy Our mission is to make privacy and user ownership the default online. To do so, we build simple, flexible...  ...enclaves, and more – all for a team that ships daily Drive observability across our product and help keep our systems up and... 
    Flexible hours

    Stripe

    New York, NY
    4 days ago
  • $160k - $250k

     ...Careers at Keel Keel Infrastructure is a publicly traded energy and digital infrastructure...  ...looking for an OT Infrastructure Engineer to join our team Compensation...  ...operations. IOC Integration & OT Observability Define and maintain the OT data... 
    Immediate start
    Remote work
    Worldwide

    Keel Infrastructure

    New York, NY
    1 day ago
  •  ...Kernel Infrastructure Engineer Kernel is crazy fast, open source browser infrastructure for AI agents. We handle autoscaling, observability, and the messy details of web interaction, so developers can focus on what their agents do instead of how they do it. Teams... 
    Relocation package

    Kernel (yc S25)

    New York, NY
    3 days ago
  •  ...Francisco, NYC, or London offices. About the Role As an Infrastructure Engineer at Mercor, you'll build and scale the systems that power...  ...architectures, streamline deployments, and improve observability. We're hiring broadly across Infrastructure: Developer... 
    Work at office
    Relocation package

    Mercor Alabaster

    New York, NY
    20 hours ago
  •  ...AI Platform Engineer Join a next-generation investment and technology team in New...  ...member brings deep expertise in MLOps, AI Infrastructure, CI/CD and Data Pipelines Engineering—...  ...(MCP). You will ensure traceability, observability, and scalability from data ingestion... 
    Work at office
    3 days per week

    QD Staff

    New York, NY
    1 day ago
  • $130k - $240k

     ...Maxana is seeking an experienced Infrastructure Engineer for a confidential client — a fast-growing AI company. In this role you will build...  ...systems, and cloud-native platforms Improve reliability, observability, and performance across the platform layer Collaborate... 
    Flexible hours

    Maxana

    New York, NY
    6 days ago
  • $135k - $200k

     ...Forward Deployed Infrastructure Engineer Palantir builds the world's leading software for data-driven decisions and operations. By bringing...  ...complex systems issues independently using observability tools and service logs. Ability to identify and automate... 
    Work experience placement
    Work at office
    Remote work
    Work from home
    Relocation package

    Palantir Technologies

    New York, NY
    20 hours ago
  •  ...must be available, performant, and reliable, 24/7. As an Infrastructure engineer, you'll be at the heart of making this a reality, impacting...  ...the on-call-driving fix today while shaping the multi-year observability, cost, and reliability investments that move WRITER's... 
    Full time
    Work at office
    Local area
    Flexible hours
    Shift work

    Writer Corporation

    New York, NY
    2 days ago
  • $160k - $200k

     ...Senior Infrastructure Engineer New York, New York Applecart is the leading technology company that C-suites rely on to reach business...  ...Engineering teams. Your work will span cloud infrastructure, observability, infrastructure-as-code, application deployment strategy,... 
    Work at office
    Remote work
    Work from home
    Relocation
    Relocation package
    Monday to Friday
    Day shift

    Applecart

    New York, NY
    20 hours ago
  • $140k - $200k

     ...credit and structured finance. We are engineers and investors working together to...  ...The Role We are looking for an Infrastructure Engineer to build and operate the foundational...  ..., orchestration, networking, and observability that everything else runs on. You will... 
    Flexible hours

    Anthelion Capital Holdings

    New York, NY
    20 hours ago
  • $200k

     ...Title: Software Engineer - Infrastructure Location: New York City (Onsite, 5 days/week) Compensation: $200,000 - $440,000 base...  ...distributed systems Improve system reliability, observability, and performance at the infrastructure layer Partner closely... 

    Harnham

    New York, NY
    4 days ago
  •  ...The Role We're looking for Infrastructure Engineers who will be instrumental in building and securing the backbone of our enterprise-grade...  ...will do ~ Knowledge of modern monitoring, logging, and observability tools; we use Datadog ~ Understanding of compliance requirements... 
    Work at office

    Rowspace

    New York, NY
    1 day ago
  • $180k - $250k

     ...of fast-growing companies. The Team The Engineering team builds the core systems and infrastructure that power Crosby's AI-first platform. We operate...  ...engineers to ensure our systems are performant, observable, and easy to operate. This role is for someone who... 

    C R O S B Y

    New York, NY
    2 days ago
  •  ...Accel) to build an exceptional team of engineers and operators. Our number one...  ...this page. About the Role As an Infrastructure Engineer at Forus, you'll be our first...  ...foundational infra practices — from CI/CD to observability to cloud architecture — and set the... 
    Full time
    Work at office

    Tandem

    New York, NY
    15 days ago
  • $170k - $220k

     ...Career Renew is recruiting for one of its clients a Senior Infrastructure Engineer - this is a fully remote role for US based candidates,...  ...distributed systems fundamentals (availability, consistency, observability, fault tolerance) ~ Experience designing and maintaining... 
    Work at office
    Remote work

    Career Renew

    New York, NY
    3 days ago
  • $150k - $180k

     ...Infrastructure Engineer New York, New York, United States DriveWealth is on a mission to make investing easier. We believe that everyone...  ...application code are deployed safely and frequently Observability & Reliability: Implement comprehensive monitoring (using Prometheus... 
    Full time
    Work at office
    Worldwide
    Shift work

    DriveWealth

    New York, NY
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Infrastructure Engineer (Observability). Be the first to apply!