Infrastructure Engineer (Observability)
$180k - $200kLightning AI
Infrastructure Engineer (Observability)
Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems—designed to take ideas from research to production with less friction.
Through our merger with Voltage Park, a neocloud and AI Factory, Lightning AI combines developer-first software with cost-efficient, large-scale compute. Teams get the tools they need for experimentation, training, and production inference, with security, observability, and control built in.
We serve solo researchers, startups, and large enterprises. Lightning AI operates globally with offices in New York City, San Francisco, Seattle, and London, and is backed by Coatue, Index Ventures, Bain Capital Ventures, and Firstminute.
What We're Looking For
Lightning AI is seeking an Observability Infrastructure Engineer to join our Infrastructure Engineering team.
In this role, you will own and evolve observability systems across large-scale, GPU-enabled bare-metal infrastructure. You'll operate at the intersection of infrastructure, data, and product, building platforms for metrics, logs, traces, and alerting that power both internal operations and customer-facing visibility.
You will play a key role in productizing observability, enabling scalable, multi-tenant monitoring experiences while keeping pace with rapid infrastructure buildouts. This includes designing telemetry pipelines, improving signal quality, and delivering actionable insights that ensure reliability and transparency across our platform.
We're flexible on location for this team. This role can work hybrid out of one of our US-based hubs (Seattle, NYC, or SF) or fully remote within the U.S., with occasional company and team offsites. We are not able to provide visa sponsorship for this position at this time.
What You'll Do
Observability Platform & Productization
- Own and evolve a scalable observability platform spanning metrics, logs, traces, and events
- Drive the productization of observability capabilities for both internal teams and external customers
- Design multi-tenant observability systems with scoped access, RBAC, and customer-facing visibility
- Continuously improve observability systems to keep pace with rapid infrastructure buildouts
Telemetry & Data Pipelines
- Design and operate telemetry pipelines ingesting data from GPUs, CPUs, networking (Ethernet & InfiniBand), containers, APIs, and BMC/Redfish
- Build systems to correlate signals across infrastructure layers to enable faster debugging and root cause analysis
- Implement streaming and real-time data pipelines using tools such as Kafka, OTEL, Promtail, or similar
Alerting, Reliability & Insights
- Design and implement noise-resistant alerting systems to improve signal quality and reduce operational load
- Create dashboards and alerting for InfraOps, Engineering, and Customer Success teams
- Build automated insights and enable proactive detection, forecasting, and system health visibility at scale
Systems & Infrastructure Engineering
- Contribute to broader infrastructure engineering projects beyond observability
- Partner with infrastructure and platform teams to embed observability into core systems and workflows
- Support large-scale, distributed systems across compute, networking, and storage environments
Cross-Functional Collaboration
- Work closely with customer-facing teams to deliver external observability experiences
- Collaborate with engineering, operations, and support teams to improve system transparency and reliability
- Help define best practices for observability across the organization
What You'll Need
Required Qualifications
- 5+ years of experience in infrastructure engineering, SRE, or observability-focused roles
- Strong experience with monitoring systems such as Prometheus, Grafana, ELK, or VictoriaMetrics
- Experience building and operating observability platforms at scale
- Proficiency in Python, Go, or bash for automation and data integration
- Familiarity with containerized environments and Kubernetes observability
- Experience with streaming telemetry pipelines (Kafka, OTEL, Promtail, or equivalent)
- Experience with multi-tenant monitoring architectures
- Strong written and verbal communication skills
Ideal Experience
- Experience with GPU observability, particularly NVIDIA DCGM
- Experience monitoring large-scale GPU or HPC clusters
- Familiarity with InfiniBand fabric observability
- Experience building customer-facing or productized infrastructure systems
- Experience with correlation engines, RCA workflows, or predictive alerting systems
- Broad exposure to infrastructure domains including networking, storage, and provisioning
Compensation
We are committed to offering competitive compensation that reflects the value each team member brings to our mission. Final offers are based on factors such as experience, skills, geographic location, and role expectations. In addition to base salary, our total rewards package for eligible roles includes a discretionary bonus, a meaningful equity component, and comprehensive benefits.
The anticipated annual base salary range for this role is:
$180,000 - $200,000 USD
Benefits and Perks
We offer a comprehensive and competitive benefits package designed to support our employees' health, well-being, and long-term success. Benefits may vary by location, team, and role.
Benefits include:
- Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.)
- Retirement and financial wellness support (U.S.); Pension contribution (U.K.)
- Generous paid time off, plus holidays
- Paid parental leave
- Professional development support
- Wellness and work-from-home stipends
- Flexible work environment
At Lightning AI, we are committed to fostering an inclusive and diverse workplace. We believe that diverse teams drive innovation and create better products. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity, national origin, age, disability, veteran status, or any other protected characteristic. We are dedicated to building a culture where everyone can thrive and contribute to their fullest potential.
$139k - $204k
...Senior Engineer, Network Observability Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA CoreWeave is The Essential Cloud for AI... ...startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate...SuggestedTemporary workCasual workWork at officeFlexible hours- Persona is seeking a new engineer to own edge networking configuration and ensure reliable traffic handling. In this role, you will manage DNS, evolve the ingress layer, and enhance network observability while supporting regional growth. The ideal candidate will have a...SuggestedFull time
$120k - $150k
Neuberger's Technology team is seeking an Observability Engineer to lead and evolve our observability strategy across cloud and on-premises... ...business-critical systems, including external websites and key infrastructure components (e.g., firewalls, OpenShift). You will design...SuggestedWork at officeLocal areaWorldwideShift work- ...solutions, ensuring reliability, security, and cost efficiency. This fully remote position focuses on building scalable architectures, observability solutions, and leading Agile methodologies, all while collaborating with a diverse and inclusive team dedicated to excellence....SuggestedRemote work
- Framework Ventures is seeking a skilled Cloud Engineer with expertise in observability and Datadog for a full-time remote position. In this role, you will manage cloud architecture and production operations, working with Agile teams on DevOps practices. Candidates should...SuggestedRemote jobFull time
- Palantir is seeking a Senior Software Engineer for their New York office to own the observability platform. The successful candidate will work on log ingestion, processing, and monitoring solutions, while collaborating with leadership to define technical strategies. Ideal...Work at officeFlexible hours
- LGBT Great in New York is seeking an Observability Engineer to lead the observability strategy across cloud and on-premises environments. This role requires strong Datadog engineering skills, scripting experience, and the ability to drive the migration from legacy systems...Work at office
- A technology company based in the United States is seeking a Sr. Platform Engineer to manage AWS, GCP, and cloud infrastructure. In this role, you will plan monitoring and observability mechanisms, develop tooling in Rust, and ensure operations meet reliability standards...Remote jobFlexible hours
$175k - $240k
...Senior Fullstack Engineer In person 5 days/week in San Francisco We're looking for a Senior Fullstack Engineer for our commercial product LangSmith, an observability and evals platform. In this role, you'll have the opportunity to shape the technical direction of...Work at officeFlexible hours- Helius is seeking a Staff Platform Engineer to design and implement observability systems from the ground up. In this role, you'll architect new pipelines for metrics, logs, and performance debugging, ensuring reliability and scaling. With 8+ years of programming expertise...Remote job
- Technical Skills Azure DevOps (repos, pipelines) CI/CD pipelines and Infrastructure-as-Code Docker containers (JavaScript, Python) Grafana or observability tools SonarQube (code quality/security) JFrog Artifactory AI-assisted tools (e.g., GitHub Copilot) Internal Developer...
$155k - $195k
...across their organization. Founded in 2023, LangChain powers top engineering teams at companies like Replit, Lovable, Clay, Klarna,... ...working on our enterprise platform product for LLM application observability, testing, and debugging. You will: Develop new user-facing features...$128k - $160k
A leading observability company is seeking a Solution Engineer based in the United States. The successful candidate will play a critical role in the sales process, providing technical support, managing demonstrations, and collaborating with various teams. Ideal candidates...- ...Infrastructure Engineer & SRE Superblocks is building the frontier platform that enables a billion non-engineers to create software with... ...systems with strict security and data guarantees Deep observability across AI workflows and infrastructure Enterprise grade...
$190.4k - $285.6k
...Infrastructure Engineer, Privy Our mission is to make privacy and user ownership the default online. To do so, we build simple, flexible... ...enclaves, and more – all for a team that ships daily Drive observability across our product and help keep our systems up and...Flexible hours$160k - $250k
...Careers at Keel Keel Infrastructure is a publicly traded energy and digital infrastructure... ...looking for an OT Infrastructure Engineer to join our team Compensation... ...operations. IOC Integration & OT Observability Define and maintain the OT data...Immediate startRemote workWorldwide- ...Kernel Infrastructure Engineer Kernel is crazy fast, open source browser infrastructure for AI agents. We handle autoscaling, observability, and the messy details of web interaction, so developers can focus on what their agents do instead of how they do it. Teams...Relocation package
- ...Francisco, NYC, or London offices. About the Role As an Infrastructure Engineer at Mercor, you'll build and scale the systems that power... ...architectures, streamline deployments, and improve observability. We're hiring broadly across Infrastructure: Developer...Work at officeRelocation package
- ...AI Platform Engineer Join a next-generation investment and technology team in New... ...member brings deep expertise in MLOps, AI Infrastructure, CI/CD and Data Pipelines Engineering—... ...(MCP). You will ensure traceability, observability, and scalability from data ingestion...Work at office3 days per week
$130k - $240k
...Maxana is seeking an experienced Infrastructure Engineer for a confidential client — a fast-growing AI company. In this role you will build... ...systems, and cloud-native platforms Improve reliability, observability, and performance across the platform layer Collaborate...Flexible hours$135k - $200k
...Forward Deployed Infrastructure Engineer Palantir builds the world's leading software for data-driven decisions and operations. By bringing... ...complex systems issues independently using observability tools and service logs. Ability to identify and automate...Work experience placementWork at officeRemote workWork from homeRelocation package- ...must be available, performant, and reliable, 24/7. As an Infrastructure engineer, you'll be at the heart of making this a reality, impacting... ...the on-call-driving fix today while shaping the multi-year observability, cost, and reliability investments that move WRITER's...Full timeWork at officeLocal areaFlexible hoursShift work
$160k - $200k
...Senior Infrastructure Engineer New York, New York Applecart is the leading technology company that C-suites rely on to reach business... ...Engineering teams. Your work will span cloud infrastructure, observability, infrastructure-as-code, application deployment strategy,...Work at officeRemote workWork from homeRelocationRelocation packageMonday to FridayDay shift$140k - $200k
...credit and structured finance. We are engineers and investors working together to... ...The Role We are looking for an Infrastructure Engineer to build and operate the foundational... ..., orchestration, networking, and observability that everything else runs on. You will...Flexible hours$200k
...Title: Software Engineer - Infrastructure Location: New York City (Onsite, 5 days/week) Compensation: $200,000 - $440,000 base... ...distributed systems Improve system reliability, observability, and performance at the infrastructure layer Partner closely...- ...The Role We're looking for Infrastructure Engineers who will be instrumental in building and securing the backbone of our enterprise-grade... ...will do ~ Knowledge of modern monitoring, logging, and observability tools; we use Datadog ~ Understanding of compliance requirements...Work at office
$180k - $250k
...of fast-growing companies. The Team The Engineering team builds the core systems and infrastructure that power Crosby's AI-first platform. We operate... ...engineers to ensure our systems are performant, observable, and easy to operate. This role is for someone who...- ...Accel) to build an exceptional team of engineers and operators. Our number one... ...this page. About the Role As an Infrastructure Engineer at Forus, you'll be our first... ...foundational infra practices — from CI/CD to observability to cloud architecture — and set the...Full timeWork at office
$170k - $220k
...Career Renew is recruiting for one of its clients a Senior Infrastructure Engineer - this is a fully remote role for US based candidates,... ...distributed systems fundamentals (availability, consistency, observability, fault tolerance) ~ Experience designing and maintaining...Work at officeRemote work$150k - $180k
...Infrastructure Engineer New York, New York, United States DriveWealth is on a mission to make investing easier. We believe that everyone... ...application code are deployed safely and frequently Observability & Reliability: Implement comprehensive monitoring (using Prometheus...Full timeWork at officeWorldwideShift work
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Infrastructure Engineer (Observability). Be the first to apply!
- entry level infrastructure engineer New York, NY
- infrastructure automation engineer New York, NY
- security infrastructure engineer New York, NY
- senior infrastructure engineer New York, NY
- associate infrastructure engineer New York, NY
- remote infrastructure engineer New York, NY
- infrastructure engineering manager New York, NY
- infrastructure engineer New York, NY
- principal infrastructure engineer New York, NY
- data infrastructure engineer New York, NY

