Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Infrastructure Engineer (GPU & Compute)

$180k - $200k

Lightning AI

Infrastructure Engineer (GPU & Compute)

Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems—designed to take ideas from research to production with less friction.

Through our merger with Voltage Park, a neocloud and AI Factory, Lightning AI combines developer-first software with cost-efficient, large-scale compute. Teams get the tools they need for experimentation, training, and production inference, with security, observability, and control built in.

We serve solo researchers, startups, and large enterprises. Lightning AI operates globally with offices in New York City, San Francisco, Seattle, and London, and is backed by Coatue, Index Ventures, Bain Capital Ventures, and Firstminute.

What You'll Do
Systems, Image & Validation Infrastructure
  • Own and evolve systems for image management, deployment, and validation across bare-metal infrastructure
  • Run and maintain test clusters used for system validation, diagnostics, and bring-up
  • Validate firmware, drivers, and OS images across compute and GPU-enabled systems
  • Support hardware qualification efforts for next-generation platforms
GPU Diagnostics & Performance
  • Own GPU diagnostics and validation workflows across large-scale infrastructure
  • Diagnose and resolve complex issues across GPUs, drivers, OS, and hardware layers
  • Analyze system and GPU performance using tools such as NVIDIA DCGM
  • Identify failure patterns and drive improvements in system stability and validation coverage
Automation & Tooling
  • Build and maintain automation for provisioning, validation, and system bring-up
  • Develop Python-based tools and workflows to improve efficiency and reduce manual operational overhead
  • Improve the reliability, repeatability, and scalability of image pipelines and validation systems
Systems & Operations
  • Manage and operate Linux-based systems in production and validation environments
  • Manage virtualization technology
  • Support bare-metal provisioning workflows, including PXE and image-based systems
  • Interface with hardware management systems (e.g., IPMI, Redfish) for monitoring and debugging
Cross-Functional Collaboration
  • Partner with Infrastructure, Hardware, and Data Center teams on system bring-up and validation
  • Collaborate with platform and ML teams to ensure systems meet workload requirements
  • Contribute to best practices for provisioning, diagnostics, and lifecycle management of infrastructure
What You'll Need
Required Qualifications
  • 5+ years of experience in infrastructure engineering, systems engineering, or related roles
  • Strong Linux systems experience in production environments
  • Hands-on experience with GPU-enabled systems and tools such as NVIDIA DCGM
  • Familiarity with bare-metal provisioning and system bring-up workflows
  • Proficiency in Python or similar scripting/programming languages for automation
  • Ability to debug complex issues across hardware, OS, GPUs, and system software
Ideal Experience
  • Experience with high-performance interconnects (e.g., InfiniBand, NVLink)
  • Experience with PXE boot environments, LiveCD systems, or image-based provisioning workflows
  • Experience with hardware management interfaces such as iDRAC, IPMI, or Redfish
  • Data center operations experience, including working with physical hardware
  • Experience supporting AI/ML or HPC workloads at scale
  • Experience with GPU validation frameworks or large-scale hardware qualification processes
Compensation

We are committed to offering competitive compensation that reflects the value each team member brings to our mission. Final offers are based on factors such as experience, skills, geographic location, and role expectations. In addition to base salary, our total rewards package for eligible roles includes a discretionary bonus, a meaningful equity component, and comprehensive benefits.

The anticipated annual base salary range for this role is:

$180,000 - $200,000 USD

Benefits and Perks

We offer a comprehensive and competitive benefits package designed to support our employees' health, well-being, and long-term success. Benefits may vary by location, team, and role.

Benefits include:

  • Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.)
  • Retirement and financial wellness support (U.S.); Pension contribution (U.K.)
  • Generous paid time off, plus holidays
  • Paid parental leave
  • Professional development support
  • Wellness and work-from-home stipends
  • Flexible work environment

At Lightning AI, we are committed to fostering an inclusive and diverse workplace. We believe that diverse teams drive innovation and create better products. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity, national origin, age, disability, veteran status, or any other protected characteristic. We are dedicated to building a culture where everyone can thrive and contribute to their fullest potential.

Vacancy posted 3 days ago
Similar jobs that could be interesting for youBased on the Infrastructure Engineer (GPU & Compute) in New York, NY vacancy
  • $150k - $170k

     ...Infrastructure Engineer Schmidt Sciences is a nonprofit organization founded in 2024 by Eric and...  ...for impact including AI and advanced computing, astrophysics, biosciences, climate, and...  ...containerized Linux workloads, including GPU-accelerated configurations. ~ In-... 
    Suggested
    Local area

    Schmidt Entities

    New York, NY
    8 days ago
  • A pioneering AI infrastructure company is seeking a GPU Cloud Platform Engineer to design and operate large-scale GPU clusters. This remote position aims to ensure high availability and performance of containerized AI workloads across cloud environments. The ideal candidate... 
    Suggested
    Remote job

    Yotta Labs

    New York, NY
    4 days ago
  •  ...applied AI research, flexible infrastructure, and seamless developer...  ...and help build the platform engineers turn to to ship AI products....  ...workloads scale, the network is the computer. We are looking for foundational engineers to lead our GPU Networking efforts, making RDMA... 
    Suggested
    Flexible hours

    Baseten

    New York, NY
    4 days ago
  •  ...aggregating geo-distributed GPUs, enabling high-performance computing for AI training and inference on a wide spectrum...  ...AI development. ️ Role Overview We are seeking a GPU Cloud Platform Engineer to join our core infrastructure team and help build the next-generation AI compute... 
    Suggested
    Full time
    Remote work
    Flexible hours

    Yotta Labs

    New York, NY
    4 days ago
  •  ...Head of Infrastructure Engineering About the Company Pioneering cloud infrastructure company...  ...will be responsible for architecting GPU-dense clusters, aligning hardware roadmaps...  ...engineers and be at the forefront of emerging compute technologies, ensuring the... 
    Suggested

    Confidential

    New York, NY
    4 days ago
  • Vultr is seeking a Principal Security Advisor to act as a senior security authority, supporting customer engagements and technical sales. In this role, you’ll articulate the security architecture of the Vultr platform, addressing complex security questions and leading discussions...

    Vultr

    New York, NY
    2 days ago
  • A cloud computing firm is seeking a Senior Engineer to ensure the efficiency and reliability of their data center infrastructure. The role demands strong analytical abilities, problem-solving skills, and the capacity to influence stakeholders. Responsibilities include... 
    Remote work

    Nscale Ltd.

    New York, NY
    4 days ago
  •  ...Infrastructure Engineer — Hyper-V Woodinville, Washington, United States Locations: Woodinville...  ...infrastructure across the full stack — compute, storage, networking, and hypervisor...  ...(Intel Xeon, AMD EPYC, NVMe storage, GPU accelerators) sufficient to evaluate hardware... 
    Contract work
    Local area

    Redapt

    New York, NY
    4 days ago
  • $180k - $200k

     ...Infrastructure Engineer (Storage) Lightning AI is the company behind PyTorch Lightning. Founded in...  ...software with cost-efficient, large-scale compute. Teams get the tools they need for...  ...data transfer technologies (e.g., RDMA, GPU Direct Storage) Experience supporting... 
    Remote work
    Work from home
    Flexible hours

    Lightning AI

    New York, NY
    2 days ago
  • $140k - $240k

     ...Overview The Infrastructure Engineer on the Mission Control team plays a critical role in building...  ...such as AWS, GCP, or Azure, including compute, networking, storage, IAM, and managed...  ..., data pipelines, vector databases, or GPU-enabled workloads ~ Experience... 
    Local area
    Remote work

    Bayview Asset Management

    New York, NY
    4 days ago
  • $130k - $240k

     ...Maxana is seeking an experienced Infrastructure Engineer for a confidential client — a fast-growing AI company. In this role you will build...  ...-scale ML training and inference workloads Work with GPU and compute infrastructure, distributed systems, and cloud-native platforms... 
    Flexible hours

    Maxana

    New York, NY
    1 day ago
  •  ...AI Platform Engineer Join a next-generation investment and technology team in New York...  ...brings deep expertise in MLOps, AI Infrastructure, CI/CD and Data Pipelines Engineering—ensuring...  ...(batch and real-time), including GPU compute provisioning and container orchestration... 
    Work at office
    3 days per week

    QD Staff

    New York, NY
    1 day ago
  •  ...Position Title: Cloud Computing Specialist Location: New York, NY (On-Site) Position Type: Contract-to-Hire Position Description...  ...~ Bachelor's degree in Computer Science, Information Systems, Engineering, or equivalent Preferred/Nice-to-Have: Experience... 
    Contract work
    For contractors
    Remote work

    Seneca

    New York, NY
    12 hours ago
  •  ...candidate with CMP Form and DOB. New Position : Cloud Computing Specialist (ServiceNow - Customer Service Management )...  ...management, documentation. - Education: Bachelor's in CS/IS/Engineering or equivalent experience. - Professional skills: analytical... 
    Remote work

    Anveta

    New York, NY
    4 days ago
  • $160k - $240k

    Senior Software Engineer - Public Cloud Engineering Managed Compute Location New York Business Area Engineering and CTO Ref # 10050591 Description & Requirements...  ...machines and containers, they're using the infrastructure and patterns our team built. We own the full stack... 
    Temporary work
    For contractors
    Work experience placement

    Bloomberg L.P.

    New York, NY
    2 days ago
  • A leading tech company in the United States is seeking an experienced Infrastructure GPU Engineer to build and support high-performance cloud infrastructure. This role involves optimizing resource allocation for GPU workloads, ensuring system reliability, and collaborating... 
    Remote job

    DevOpsChat

    New York, NY
    4 days ago
  •  ...Nebius is leading a new era in cloud infrastructure for the global AI economy. We are...  ...AI/ML infrastructure. Built by engineers, for engineers. From large-scale GPU orchestration to inference...  ...we own the hard problems across compute, storage, networking and applied... 
    Remote work

    jobr.pro

    Brooklyn, NY
    1 day ago
  •  ...breaking down the barriers to computing power with our Open-Access...  ...globe, we offer an innovative GPU marketplace and AI inference...  ...on. Maintain benchmarking infrastructure Own and maintain the...  ...configs, and work with supply and engineering to close the loop. What We’... 

    Hyperbolic Labs

    New York, NY
    4 days ago
  • $139k - $204k

     ...Senior Engineer, Network Observability Livingston, NJ / New York...  ...CoreWeave combines superior infrastructure performance with deep technical...  ...breakthroughs and turn compute into capability. Founded in 2...  ...systems that keep CoreWeave's GPU cloud network operating reliably... 
    Temporary work
    Casual work
    Work at office
    Flexible hours

    CoreWeave

    New York, NY
    3 days ago
  •  ...About Us: AI needs a new infrastructure layer. We're building it at Modal. Every era of computing brought new workloads that previous...  ...rely on Modal for instant GPU access, sub-second container...  ...medalists, and experienced engineering and product leaders with decades... 

    Modal

    New York, NY
    2 days ago
  • Reddit, Inc. is seeking a Senior Software Engineer for their Compute Platform team, located remotely in the United States. This role involves improving the infrastructure and software development for Reddit’s platform, directly impacting millions of users. As a generalist... 
    Remote job

    Reddit, Inc.

    New York, NY
    4 days ago
  • Temporal is seeking a Senior Software Engineer for its Compute team. This role focuses on building managed compute primitives and ensuring the operational success of cloud services. Candidates should possess strong experience with distributed systems and a passion for... 

    temporal

    New York, NY
    4 days ago
  • $100k - $185k

     ...acquisition, and background investigation services. Summary The Cloud Computing Specialist (CCS) - SME provides expert oversight of cloud...  ...& Design Requirements, Cloud Data Security, Cloud Platform & Infrastructure Security, Cloud Application Security, Operations, Legal &... 
    Contract work
    Local area

    Goldbelt, Inc.

    New York, NY
    2 days ago
  • Cloud Computing Application Architect (Remote - US) We are looking for a Cloud Computing...  .... Strong collaboration across engineering, data, and operational teams is essential...  ...for security, automation, CI/CD, and infrastructure performance. Guide application development... 
    Remote job
    Flexible hours

    Jobgether

    New York, NY
    4 days ago
  • Alumni Ventures is hiring for a Platform Engineering role in New York City, focused on developing an ultrafast AI inference platform. This...  ...interesting challenges like low-level systems development, and efficient GPU workload management. Successful candidates will have 3-5 years... 
    Remote job

    Alumni Ventures

    New York, NY
    4 days ago
  • $160k - $230k

    Nscale Ltd. is seeking a Technical Program Manager for their GPU infrastructure in New York. In this role, you'll drive cross-functional programs, ensuring complex projects are delivered effectively. Candidates should have 3-6 years of technical program management experience... 
    Flexible hours

    Nscale Ltd.

    New York, NY
    4 days ago
  • A technology-driven company is searching for an Engineering Manager to lead their Compute Team. This role demands a mix of leadership and hands-on experience in software engineering. The ideal candidate should have over 7 years in the field, including 2 years in a management... 
    Remote job

    PRAGMATIKE

    New York, NY
    4 days ago
  • A cutting-edge AI company in New York is seeking a skilled engineer to work on cluster management and GPU infrastructure. You will be responsible for building tools for monitoring and observability while collaborating closely with training teams. Ideal candidates have systems... 

    Reflection

    New York, NY
    1 day ago
  • StubHub is looking for a Platform Engineer to enhance their core Compute Platform in New York. This role involves developing solutions for AI and optimizing...  ...in Java, C#, Python, Go, and experience with cloud infrastructure. The position offers a hybrid work model, competitive... 

    StubHub

    New York, NY
    12 hours ago
  •  ...Senior Storage Solutions Engineer (Parallel Computing / HPC Environment) Career Developers Inc., a distinguished staffing and consulting...  ...environments Collaborate with global engineering and infrastructure teams to enforce standards and enhance system reliability... 
    Contract work
    Remote work

    Career Developers

    New York, NY
    13 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Infrastructure Engineer (GPU & Compute). Be the first to apply!