Infrastructure Engineer (GPU & Compute)
$180k - $200kLightning AI
Infrastructure Engineer (GPU & Compute)
Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems—designed to take ideas from research to production with less friction.
Through our merger with Voltage Park, a neocloud and AI Factory, Lightning AI combines developer-first software with cost-efficient, large-scale compute. Teams get the tools they need for experimentation, training, and production inference, with security, observability, and control built in.
We serve solo researchers, startups, and large enterprises. Lightning AI operates globally with offices in New York City, San Francisco, Seattle, and London, and is backed by Coatue, Index Ventures, Bain Capital Ventures, and Firstminute.
What You'll Do
Systems, Image & Validation Infrastructure
- Own and evolve systems for image management, deployment, and validation across bare-metal infrastructure
- Run and maintain test clusters used for system validation, diagnostics, and bring-up
- Validate firmware, drivers, and OS images across compute and GPU-enabled systems
- Support hardware qualification efforts for next-generation platforms
GPU Diagnostics & Performance
- Own GPU diagnostics and validation workflows across large-scale infrastructure
- Diagnose and resolve complex issues across GPUs, drivers, OS, and hardware layers
- Analyze system and GPU performance using tools such as NVIDIA DCGM
- Identify failure patterns and drive improvements in system stability and validation coverage
Automation & Tooling
- Build and maintain automation for provisioning, validation, and system bring-up
- Develop Python-based tools and workflows to improve efficiency and reduce manual operational overhead
- Improve the reliability, repeatability, and scalability of image pipelines and validation systems
Systems & Operations
- Manage and operate Linux-based systems in production and validation environments
- Manage virtualization technology
- Support bare-metal provisioning workflows, including PXE and image-based systems
- Interface with hardware management systems (e.g., IPMI, Redfish) for monitoring and debugging
Cross-Functional Collaboration
- Partner with Infrastructure, Hardware, and Data Center teams on system bring-up and validation
- Collaborate with platform and ML teams to ensure systems meet workload requirements
- Contribute to best practices for provisioning, diagnostics, and lifecycle management of infrastructure
What You'll Need
Required Qualifications
- 5+ years of experience in infrastructure engineering, systems engineering, or related roles
- Strong Linux systems experience in production environments
- Hands-on experience with GPU-enabled systems and tools such as NVIDIA DCGM
- Familiarity with bare-metal provisioning and system bring-up workflows
- Proficiency in Python or similar scripting/programming languages for automation
- Ability to debug complex issues across hardware, OS, GPUs, and system software
Ideal Experience
- Experience with high-performance interconnects (e.g., InfiniBand, NVLink)
- Experience with PXE boot environments, LiveCD systems, or image-based provisioning workflows
- Experience with hardware management interfaces such as iDRAC, IPMI, or Redfish
- Data center operations experience, including working with physical hardware
- Experience supporting AI/ML or HPC workloads at scale
- Experience with GPU validation frameworks or large-scale hardware qualification processes
Compensation
We are committed to offering competitive compensation that reflects the value each team member brings to our mission. Final offers are based on factors such as experience, skills, geographic location, and role expectations. In addition to base salary, our total rewards package for eligible roles includes a discretionary bonus, a meaningful equity component, and comprehensive benefits.
The anticipated annual base salary range for this role is:
$180,000 - $200,000 USD
Benefits and Perks
We offer a comprehensive and competitive benefits package designed to support our employees' health, well-being, and long-term success. Benefits may vary by location, team, and role.
Benefits include:
- Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.)
- Retirement and financial wellness support (U.S.); Pension contribution (U.K.)
- Generous paid time off, plus holidays
- Paid parental leave
- Professional development support
- Wellness and work-from-home stipends
- Flexible work environment
At Lightning AI, we are committed to fostering an inclusive and diverse workplace. We believe that diverse teams drive innovation and create better products. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity, national origin, age, disability, veteran status, or any other protected characteristic. We are dedicated to building a culture where everyone can thrive and contribute to their fullest potential.
$150k - $170k
...Infrastructure Engineer Schmidt Sciences is a nonprofit organization founded in 2024 by Eric and... ...for impact including AI and advanced computing, astrophysics, biosciences, climate, and... ...containerized Linux workloads, including GPU-accelerated configurations. ~ In-...SuggestedLocal area- A pioneering AI infrastructure company is seeking a GPU Cloud Platform Engineer to design and operate large-scale GPU clusters. This remote position aims to ensure high availability and performance of containerized AI workloads across cloud environments. The ideal candidate...SuggestedRemote job
- ...applied AI research, flexible infrastructure, and seamless developer... ...and help build the platform engineers turn to to ship AI products.... ...workloads scale, the network is the computer. We are looking for foundational engineers to lead our GPU Networking efforts, making RDMA...SuggestedFlexible hours
- ...aggregating geo-distributed GPUs, enabling high-performance computing for AI training and inference on a wide spectrum... ...AI development. ️ Role Overview We are seeking a GPU Cloud Platform Engineer to join our core infrastructure team and help build the next-generation AI compute...SuggestedFull timeRemote workFlexible hours
- ...Head of Infrastructure Engineering About the Company Pioneering cloud infrastructure company... ...will be responsible for architecting GPU-dense clusters, aligning hardware roadmaps... ...engineers and be at the forefront of emerging compute technologies, ensuring the...Suggested
- Vultr is seeking a Principal Security Advisor to act as a senior security authority, supporting customer engagements and technical sales. In this role, you’ll articulate the security architecture of the Vultr platform, addressing complex security questions and leading discussions...
- A cloud computing firm is seeking a Senior Engineer to ensure the efficiency and reliability of their data center infrastructure. The role demands strong analytical abilities, problem-solving skills, and the capacity to influence stakeholders. Responsibilities include...Remote work
- ...Infrastructure Engineer — Hyper-V Woodinville, Washington, United States Locations: Woodinville... ...infrastructure across the full stack — compute, storage, networking, and hypervisor... ...(Intel Xeon, AMD EPYC, NVMe storage, GPU accelerators) sufficient to evaluate hardware...Contract workLocal area
$180k - $200k
...Infrastructure Engineer (Storage) Lightning AI is the company behind PyTorch Lightning. Founded in... ...software with cost-efficient, large-scale compute. Teams get the tools they need for... ...data transfer technologies (e.g., RDMA, GPU Direct Storage) Experience supporting...Remote workWork from homeFlexible hours$140k - $240k
...Overview The Infrastructure Engineer on the Mission Control team plays a critical role in building... ...such as AWS, GCP, or Azure, including compute, networking, storage, IAM, and managed... ..., data pipelines, vector databases, or GPU-enabled workloads ~ Experience...Local areaRemote work$130k - $240k
...Maxana is seeking an experienced Infrastructure Engineer for a confidential client — a fast-growing AI company. In this role you will build... ...-scale ML training and inference workloads Work with GPU and compute infrastructure, distributed systems, and cloud-native platforms...Flexible hours- ...AI Platform Engineer Join a next-generation investment and technology team in New York... ...brings deep expertise in MLOps, AI Infrastructure, CI/CD and Data Pipelines Engineering—ensuring... ...(batch and real-time), including GPU compute provisioning and container orchestration...Work at office3 days per week
- ...Position Title: Cloud Computing Specialist Location: New York, NY (On-Site) Position Type: Contract-to-Hire Position Description... ...~ Bachelor's degree in Computer Science, Information Systems, Engineering, or equivalent Preferred/Nice-to-Have: Experience...Contract workFor contractorsRemote work
- ...candidate with CMP Form and DOB. New Position : Cloud Computing Specialist (ServiceNow - Customer Service Management )... ...management, documentation. - Education: Bachelor's in CS/IS/Engineering or equivalent experience. - Professional skills: analytical...Remote work
$160k - $240k
Senior Software Engineer - Public Cloud Engineering Managed Compute Location New York Business Area Engineering and CTO Ref # 10050591 Description & Requirements... ...machines and containers, they're using the infrastructure and patterns our team built. We own the full stack...Temporary workFor contractorsWork experience placement- A leading tech company in the United States is seeking an experienced Infrastructure GPU Engineer to build and support high-performance cloud infrastructure. This role involves optimizing resource allocation for GPU workloads, ensuring system reliability, and collaborating...Remote job
- ...Nebius is leading a new era in cloud infrastructure for the global AI economy. We are... ...AI/ML infrastructure. Built by engineers, for engineers. From large-scale GPU orchestration to inference... ...we own the hard problems across compute, storage, networking and applied...Remote work
- ...breaking down the barriers to computing power with our Open-Access... ...globe, we offer an innovative GPU marketplace and AI inference... ...on. Maintain benchmarking infrastructure Own and maintain the... ...configs, and work with supply and engineering to close the loop. What We’...
$139k - $204k
...Senior Engineer, Network Observability Livingston, NJ / New York... ...CoreWeave combines superior infrastructure performance with deep technical... ...breakthroughs and turn compute into capability. Founded in 2... ...systems that keep CoreWeave's GPU cloud network operating reliably...Temporary workCasual workWork at officeFlexible hours- ...About Us: AI needs a new infrastructure layer. We're building it at Modal. Every era of computing brought new workloads that previous... ...rely on Modal for instant GPU access, sub-second container... ...medalists, and experienced engineering and product leaders with decades...
- Reddit, Inc. is seeking a Senior Software Engineer for their Compute Platform team, located remotely in the United States. This role involves improving the infrastructure and software development for Reddit’s platform, directly impacting millions of users. As a generalist...Remote job
- Temporal is seeking a Senior Software Engineer for its Compute team. This role focuses on building managed compute primitives and ensuring the operational success of cloud services. Candidates should possess strong experience with distributed systems and a passion for...
$100k - $185k
...acquisition, and background investigation services. Summary The Cloud Computing Specialist (CCS) - SME provides expert oversight of cloud... ...& Design Requirements, Cloud Data Security, Cloud Platform & Infrastructure Security, Cloud Application Security, Operations, Legal &...Contract workLocal area- Cloud Computing Application Architect (Remote - US) We are looking for a Cloud Computing... .... Strong collaboration across engineering, data, and operational teams is essential... ...for security, automation, CI/CD, and infrastructure performance. Guide application development...Remote jobFlexible hours
- Alumni Ventures is hiring for a Platform Engineering role in New York City, focused on developing an ultrafast AI inference platform. This... ...interesting challenges like low-level systems development, and efficient GPU workload management. Successful candidates will have 3-5 years...Remote job
$160k - $230k
Nscale Ltd. is seeking a Technical Program Manager for their GPU infrastructure in New York. In this role, you'll drive cross-functional programs, ensuring complex projects are delivered effectively. Candidates should have 3-6 years of technical program management experience...Flexible hours- A technology-driven company is searching for an Engineering Manager to lead their Compute Team. This role demands a mix of leadership and hands-on experience in software engineering. The ideal candidate should have over 7 years in the field, including 2 years in a management...Remote job
- A cutting-edge AI company in New York is seeking a skilled engineer to work on cluster management and GPU infrastructure. You will be responsible for building tools for monitoring and observability while collaborating closely with training teams. Ideal candidates have systems...
- StubHub is looking for a Platform Engineer to enhance their core Compute Platform in New York. This role involves developing solutions for AI and optimizing... ...in Java, C#, Python, Go, and experience with cloud infrastructure. The position offers a hybrid work model, competitive...
- ...Senior Storage Solutions Engineer (Parallel Computing / HPC Environment) Career Developers Inc., a distinguished staffing and consulting... ...environments Collaborate with global engineering and infrastructure teams to enforce standards and enhance system reliability...Contract workRemote work
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Infrastructure Engineer (GPU & Compute). Be the first to apply!
- security infrastructure engineer New York, NY
- principal infrastructure engineer New York, NY
- associate infrastructure engineer New York, NY
- lead infrastructure engineer New York, NY
- remote infrastructure engineer New York, NY
- infrastructure developer New York, NY
- senior infrastructure engineer New York, NY
- entry level infrastructure engineer New York, NY
- infrastructure automation engineer New York, NY
- infrastructure engineer New York, NY

