Infrastructure Engineer (Observability)
$180k - $200kLightning AI
Infrastructure Engineer (Observability)
Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems—designed to take ideas from research to production with less friction.
Through our merger with Voltage Park, a neocloud and AI Factory, Lightning AI combines developer-first software with cost-efficient, large-scale compute. Teams get the tools they need for experimentation, training, and production inference, with security, observability, and control built in.
We serve solo researchers, startups, and large enterprises. Lightning AI operates globally with offices in New York City, San Francisco, Seattle, and London, and is backed by Coatue, Index Ventures, Bain Capital Ventures, and Firstminute.
What We're Looking For
Lightning AI is seeking an Observability Infrastructure Engineer to join our Infrastructure Engineering team.
In this role, you will own and evolve observability systems across large-scale, GPU-enabled bare-metal infrastructure. You'll operate at the intersection of infrastructure, data, and product, building platforms for metrics, logs, traces, and alerting that power both internal operations and customer-facing visibility.
You will play a key role in productizing observability, enabling scalable, multi-tenant monitoring experiences while keeping pace with rapid infrastructure buildouts. This includes designing telemetry pipelines, improving signal quality, and delivering actionable insights that ensure reliability and transparency across our platform.
We're flexible on location for this team. This role can work hybrid out of one of our US-based hubs (Seattle, NYC, or SF) or fully remote within the U.S., with occasional company and team offsites. We are not able to provide visa sponsorship for this position at this time.
What You'll Do
Observability Platform & Productization
- Own and evolve a scalable observability platform spanning metrics, logs, traces, and events
- Drive the productization of observability capabilities for both internal teams and external customers
- Design multi-tenant observability systems with scoped access, RBAC, and customer-facing visibility
- Continuously improve observability systems to keep pace with rapid infrastructure buildouts
Telemetry & Data Pipelines
- Design and operate telemetry pipelines ingesting data from GPUs, CPUs, networking (Ethernet & InfiniBand), containers, APIs, and BMC/Redfish
- Build systems to correlate signals across infrastructure layers to enable faster debugging and root cause analysis
- Implement streaming and real-time data pipelines using tools such as Kafka, OTEL, Promtail, or similar
Alerting, Reliability & Insights
- Design and implement noise-resistant alerting systems to improve signal quality and reduce operational load
- Create dashboards and alerting for InfraOps, Engineering, and Customer Success teams
- Build automated insights and enable proactive detection, forecasting, and system health visibility at scale
Systems & Infrastructure Engineering
- Contribute to broader infrastructure engineering projects beyond observability
- Partner with infrastructure and platform teams to embed observability into core systems and workflows
- Support large-scale, distributed systems across compute, networking, and storage environments
Cross-Functional Collaboration
- Work closely with customer-facing teams to deliver external observability experiences
- Collaborate with engineering, operations, and support teams to improve system transparency and reliability
- Help define best practices for observability across the organization
What You'll Need
Required Qualifications
- 5+ years of experience in infrastructure engineering, SRE, or observability-focused roles
- Strong experience with monitoring systems such as Prometheus, Grafana, ELK, or VictoriaMetrics
- Experience building and operating observability platforms at scale
- Proficiency in Python, Go, or bash for automation and data integration
- Familiarity with containerized environments and Kubernetes observability
- Experience with streaming telemetry pipelines (Kafka, OTEL, Promtail, or equivalent)
- Experience with multi-tenant monitoring architectures
- Strong written and verbal communication skills
Ideal Experience
- Experience with GPU observability, particularly NVIDIA DCGM
- Experience monitoring large-scale GPU or HPC clusters
- Familiarity with InfiniBand fabric observability
- Experience building customer-facing or productized infrastructure systems
- Experience with correlation engines, RCA workflows, or predictive alerting systems
- Broad exposure to infrastructure domains including networking, storage, and provisioning
Compensation
We are committed to offering competitive compensation that reflects the value each team member brings to our mission. Final offers are based on factors such as experience, skills, geographic location, and role expectations. In addition to base salary, our total rewards package for eligible roles includes a discretionary bonus, a meaningful equity component, and comprehensive benefits.
The anticipated annual base salary range for this role is:
$180,000 - $200,000 USD
Benefits and Perks
We offer a comprehensive and competitive benefits package designed to support our employees' health, well-being, and long-term success. Benefits may vary by location, team, and role.
Benefits include:
- Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.)
- Retirement and financial wellness support (U.S.); Pension contribution (U.K.)
- Generous paid time off, plus holidays
- Paid parental leave
- Professional development support
- Wellness and work-from-home stipends
- Flexible work environment
At Lightning AI, we are committed to fostering an inclusive and diverse workplace. We believe that diverse teams drive innovation and create better products. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity, national origin, age, disability, veteran status, or any other protected characteristic. We are dedicated to building a culture where everyone can thrive and contribute to their fullest potential.
$139k - $204k
...Senior Engineer, Network Observability Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA CoreWeave is The Essential Cloud for AI... ...startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate...SuggestedTemporary workCasual workWork at officeFlexible hours$130k - $180k
...Senior Cloud Engineer, Observability At Bayer we're visionaries, driven to solve the world's toughest challenges and striving for a world... ...metrics to drive behavior ~ Strong proficiency with Infrastructure as Code (Terraform; CloudFormation a plus). ~ Strong programming...SuggestedWork at office- ...Lead the architecture and implementation of a comprehensive observability strategy across the entire SIEM modernization ecosystem, spanning... ..., and executive-level views). Partner closely with Security Engineering, Platform Engineering, and Data Engineering to ensure...Suggested
- ...digital business. By weaving together advances in cloud infrastructure, automation and analytics, and software delivery, we... ...enrichment of ideas and perspectives at AHEAD. AHEAD’s Sr. Observability Solutions Engineers are the technical experts that collaborate with our...SuggestedWork at office
- ...Elasticsearch B.V. is looking for an experienced Product Manager to join the Observability team and lead the Infrastructure Observability vision. This role is pivotal for ensuring outstanding user experiences and involves defining strategies, identifying opportunities...Suggested
- ...~ Manage and maintain core Microsoft infrastructure services including Windows Server,... ...infrastructure monitoring, alerting, and observability platforms including New Relic and Azure... ...contribute to the development of AI-driven engineering workflows including agentic...Local area
$180k - $200k
...Infrastructure Engineer (Storage) Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform... ...experimentation, training, and production inference, with security, observability, and control built in. We serve solo researchers,...Remote workWork from homeFlexible hours- ...make a real difference in public safety infrastructure for communities across the country.... ...We're hiring a Senior Infrastructure Engineer to help build out the Aurelian platform... ...and increasingly complex queries. Observability: Creating great instrumentation, logging...Full timeLive inWork at officeRelocation package
- ...Kong in AWS. • Create scalable, optimized and secure Kubernetes-based infrastructure. Experience to apply kong patches to Kubernetes cluster • Experience with Observability tool (OTEL framework) with deep skills in Datadog and Splunk for metrics and logs...
$100.08k - $150.12k
...that's working on AI for the common good? If so, our infrastructure team is seeking an Infrastructure Engineer who thrives both independently and as part of a... ...information in these situations. The ability to observe details at close range. Can work under deadlines...Contract workWork at officeFlexible hoursWeekend work$130k - $240k
...Maxana is seeking an experienced Infrastructure Engineer for a confidential client — a fast-growing AI company. In this role you will build... ...systems, and cloud-native platforms Improve reliability, observability, and performance across the platform layer Collaborate...Flexible hours$130k - $200k
...Senior Infrastructure Engineer Seattle, WA Gradial helps marketers and creatives move from idea to execution faster. Our platform turns... .... Lead Kubernetes cluster management, CI/CD pipelines, observability tooling, and infrastructure-as-code efforts. Anticipate...Full time$113k - $125k
...5 Job Level: Mid Level Home District/Group: Kiewit Infrastructure Engineers Department: Design Engineering Market: Transportation... ...technical support with investigations and site visit observations during project construction. Concrete Engineering is considered...Full timeFor contractorsFor subcontractorWork at officeWeekend work$300k - $405k
...Infrastructure Engineer, Sandboxing San Francisco, CA | New York City, NY | Seattle, WA About Anthropic Anthropic's mission is to... ...infrastructure solutions Develop monitoring, alerting, and observability systems to ensure operational excellence Participate in...Work at officeVisa sponsorshipFlexible hours$105k - $135k
...Sr. Infrastructure Engineer Omnidian, Inc. is a fast-growing Series C tech-enabled service company revolutionizing performance assurance... ...issues that exceed standard service desk capabilities. Observability & Monitoring: Deploy and fine-tune infrastructure...Work at officeImmediate start- ...Kernel Infrastructure Engineer Kernel is crazy fast, open source browser infrastructure for AI agents. We handle autoscaling, observability, and the messy details of web interaction, so developers can focus on what their agents do instead of how they do it. Teams...Relocation package
$188k - $230.35k
...consumer app and future carrier partnerships. As our Senior VoIP Infrastructure Engineer, you'll own the infrastructure and systems that power voice... ...changing course or contradicting ourselves Do, rather than observe Our Interview Process At Hiya, our interview process is...Immediate startWork from homeWorldwideFlexible hours$130k - $225k
...Summary : Overland AI is looking for an experienced Infrastructure Engineer to help design, build, and operate the systems that power... ...also developing scalable tooling that improves reliability, observability, and developer velocity. The ideal candidate has 5+...Work at officeLocal areaRemote work3 days per week$100k
...Infrastructure Engineer Galvanick protects the industrial world against cyber attacks. Our threat detection platform defends the modern... ...focus of this role is maintaining and evolving our existing observability stack to ensure system reliability and performance visibility...Permanent employmentWork at officeRelocation- ...Docker is seeking a Staff Infrastructure Engineer to define AI-assisted infrastructure operations. This role requires 8+ years in platform or SRE roles with strong AWS, Terraform, and observability stack experience. You will take ownership of infrastructure supporting...Remote workHome officeFlexible hours
$210k - $250k
...Gable.ai is looking for a Staff Software Engineer, Infrastructure, to take ownership of DevOps and infrastructure while also contributing as a software engineer. Located in Seattle, WA, this hybrid role offers a base salary range of $210K - $250K, and involves working...$78k - $185k
...Morgan Stanley seeks a Senior Platform Engineer to join their Parametric team in Seattle, focusing on delivering robust applications while collaborating with development teams. Your expertise in AWS, Azure, and Kubernetes will play a key role in empowering teams to achieve...- ...Senior Network Engineer Hybrid At Cloudflare, we are on a mission to help build... ...via tool calling. Ship them with evals, observability of agent decisions, cost tracking, and... ...experience across Cloudflare's global infrastructure. Mentor and develop junior...Local area
- ...brobstongroup.com - Jobboard is seeking a Senior Network Engineer to design and manage enterprise network infrastructure in Seattle. The role requires extensive... ...mentor junior engineers, and ensure robust network observability. Ideal candidates will have strong scripting...
$125k - $145k
...while doing it. WHAT YOU WILL DO: As a Hybrid Infrastructure Engineer Consultant, you will help design, implement, and support... ...environments Maintain and enhance monitoring, alerting, and observability platforms across infrastructure systems Implement and...Remote work- ...Summary Senior Network Engineer responsible for the design, deployment, and lifecycle management of enterprise network infrastructure across data center, campus, branch, wireless, SD... ...Juniper Mist wireless, and New Relic observability while championing automation and infrastructure...
$69.83k - $145.02k
...Advisory. KPMG is currently seeking an Associate, Infrastructure Project Advisory (Construction/Engineering) in Infrastructure and Projects Advisory for our... ...proofread narrative reports and presentations of observations and recommendations as well as review datasets,...Full timeContract workH1bLocal area$137k - $270k
...applications. Join and be a part of leading the MongoDB Networking Observability team, helping build the core of a distributed database! Our... ...features. Our team currently consists of six engineers, some located in New York City and some fully remote. We operate...Work at officeLocal areaRemote workWorldwideFlexible hours$130k - $195k
A leading AI solutions company is seeking a Technical Support Engineer to enhance customer support for technical users, including AI engineers and infrastructure architects. You will help debug complex production applications, collaborate with engineering teams, and develop...Remote job$216k - $378k
...Director of Engineering - Infrastructure Rippling gives businesses one place to run HR, IT, and Finance. It brings together all of the workforce... ...), Data (Document, Object, Relational, Streaming), Observability, Internal Tools, DevX, CI/CD and Application Frameworks...
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Infrastructure Engineer (Observability). Be the first to apply!
- entry level infrastructure engineer Seattle, WA
- infrastructure automation engineer Seattle, WA
- security infrastructure engineer Seattle, WA
- senior infrastructure engineer Seattle, WA
- remote infrastructure engineer Seattle, WA
- infrastructure engineering manager Seattle, WA
- infrastructure engineer Seattle, WA
- principal infrastructure engineer Seattle, WA
- data infrastructure engineer Seattle, WA
- infrastructure developer Seattle, WA




