Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Lead Infrastructure and Reliability Engineer (Systems & Scale)

$230k - $360k

Luma AI

About Luma AI

A new class of intelligence is emerging, systems that understand and generate the world across video, images, audio, and language.

Building multimodal AGI is not just a modeling challenge. It is an infrastructure challenge at the edge of what hardware, software, and organizations can support.

At Luma, we operate rapidly scaling 10k+ GPU fleets, pushing utilization, throughput, and reliability hard enough that yesterday's solutions break regularly. Researchers depend on this infrastructure to move the frontier forward. Customers depend on it to power real creative work.

Many companies run accelerators. Very few sit directly next to the teams inventing the models that redefine what those accelerators must do.

At Luma, improvements to scheduling, efficiency, and reliability immediately translate into faster research iteration and entirely new product capabilities.

We are still early. The playbook is still being written. A single exceptional engineer can reshape how the company operates.

Where You Come In

Our Infrastructure Engineering team is a systems engineering group with company-level responsibility. At Luma, reliability engineers work directly with the researchers and products pushing the limits of multimodal intelligence.

We operate close to the metal:

  • Kernels
  • Containers
  • Schedulers
  • Networking
  • Storage
  • GPU behavior
But we are also responsible for something bigger:

Turning deep systems knowledge into repeatable, scalable reliability for the entire company. We are hiring a leader who will define that direction. You will be a technical authority, an organizational force multiplier, and a magnet for other great engineers.

What You'll Own

Reliability of the Frontier
  • Architect and operate large, heterogeneous GPU environments under extreme demand
  • Improve utilization and performance where small gains materially change company outcomes
  • Resolve failures that span hardware, OS, runtimes, and orchestration
  • Eliminate entire classes of instability
  • Build mechanisms that make heroics unnecessary
Scaling Training & Inference
  • Define how infrastructure and workloads evolve as cluster size and concurrency grow
  • Design scheduling, placement, and resource management approaches for increasingly complex jobs
  • Work directly with research to build the systems required for new model capabilities
  • Ensure inference platforms scale rapidly without sacrificing reliability or latency
  • Anticipate where today's abstractions will fail and redesign ahead of them
Building the Organization
  • Hire and develop exceptional systems and reliability engineers
  • Set the bar for technical depth, judgment, and production ownership
  • Shape architecture early through strong partnerships with research and product
  • Translate reliability constraints into long-term platform strategy
Who You Are

Required:
  • Deep expertise in Linux and distributed systems
  • Experience operating GPU / accelerator clusters in real production environments
  • Strong fluency in Kubernetes and modern open-source infrastructure
  • Comfortable debugging across hardware → kernel → runtime → orchestration
  • You understand how systems behave under contention and at scale
  • You write code and build automation
  • You think in bottlenecks, failure modes, and tradeoffs
  • Engineers trust your judgment, especially when things break
Important: This role requires comfort operating close to upstream and close to the metal. If most of your experience has been inside highly abstracted internal platforms where others owned the underlying machinery, this is unlikely to be a match.

Leadership Expectations
  • You raise reliability standards across the company
  • You influence product and research architecture early
  • You build strong partnerships, not ticket queues
  • You attract and level up exceptional engineers
  • You are curious how models use infrastructure, because improving systems expands what becomes possible
Why This Role Is Special

Most infrastructure roles optimize mature systems. This one helps define how reliability works for a new generation of AI infrastructure.

The decisions you make here will influence:
  • How research progresses
  • How products scale
  • How customers trust us
  • And how the engineering organization grows

If you want to build the reliability foundations of a company operating at the technological frontier, we should talk.

Compensation

The base pay range for this role is $230,000 - $360,000 per year.

About Luma

Luma's mission is to build unified general intelligence that can generate, understand, and operate in the physical world.

We believe that multimodality is critical for intelligence. To go beyond language models and build more aware, capable and useful systems, the next step function change will come from vision. So, we are working on training and scaling up multimodal foundation models for systems that can see and understand, show and explain, and eventually interact with our world to effect change.
Vacancy posted 3 days ago
Similar jobs that could be interesting for youBased on the Lead Infrastructure and Reliability Engineer (Systems & Scale) in Redwood City, CA vacancy
  • $174k - $252k

    Senior Software Engineer, Infrastructure, Google Store corporate_fare Google place Mountain...  ...with developing large-scale infrastructure, distributed systems or networks, or experience with...  ...global eCommerce platform, our leading edge retail point-of-sale system... 
    Suggested
    Full time

    Google Inc.

    Mountain View, CA
    1 day ago
  • Poshmark, Inc. is seeking a talented Site Reliability Engineer to ensure the health and performance of our web-scale systems. You will collaborate with development teams...  ...Operations and a deep understanding of cloud infrastructure. Responsibilities include managing... 
    Suggested

    Poshmark, Inc.

    Redwood City, CA
    1 day ago
  • $150k - $230k

     ...Senior Systems Engineer - AI Infrastructure On Site, Palo Alto, California About the...  ...systems that run at scale. This is a systems building...  ...large-scale GPU training more reliable and efficient Debug...  ...Senior Expectations ~ Lead design of significant system... 
    Suggested

    Clockwork Systems

    Palo Alto, CA
    19 hours ago
  • A leading AI infrastructure company in California is seeking a Member of Technical...  ...and optimize large-scale AI inference systems. The role demands 5+ years in systems engineering and expertise in large-scale...  ...to debug and drive the reliability of infrastructure.... 
    Suggested
    Flexible hours

    RadixArk

    Palo Alto, CA
    4 days ago
  • $176.75k - $252.5k

     ...We are seeking a Lead Systems & Data Architect with deep...  .../ML and LLM workloads at scale. This is a highly visible...  ...the intersection of data engineering, cloud infrastructure, analytics, and AI. You will...  ...strategies. DevOps, Reliability & Security Lead the... 
    Suggested
    Full time
    Local area
    Flexible hours

    RingCentral

    Belmont, CA
    3 days ago
  • $42.07 - $58.89 per hour

     ...LV battery management systems (BMS) on vehicle platforms. As an engineer on this team, you will...  ...implement firmware validation infrastructure, equipment, automation...  ...impact on the safety, reliability, robustness, and value...  ...user experience Scale existing validation... 
    Full time
    Temporary work
    Part time
    Internship
    Flexible hours

    Tesla

    Palo Alto, CA
    2 days ago
  • $180k - $320k

     ...Description About the role Own the infrastructure that engineering depends on — Kubernetes clusters, CI/...  ...accelerator program from first silicon through scale-out. What you'll do Own the...  ...administration, Bazel build systems, ML-platform infrastructure (training... 
    H1b
    Visa sponsorship
    Work visa

    DensityAI

    Mountain View, CA
    4 days ago
  • A leading AI infrastructure company in California seeks a Member of Technical Staff — Training...  ...to design and optimize large-scale distributed training systems for frontier AI models....  ...with researchers and improving the reliability of long-running training jobs. Competitive... 

    RadixArk

    Palo Alto, CA
    2 days ago
  • $198k - $326k

     ...of our world-class software engineering team, you will take the lead in building the next-generation infrastructure and platforms for LinkedIn,...  ...algorithms, API design and systems design, and your passion for...  ...code that performs at massive scale. LinkedIn has pioneered many... 
    For contractors
    Work at office
    Flexible hours

    LinkedIn

    Mountain View, CA
    19 hours ago
  • $160.36k - $240.54k

     ...Senior Software Engineer – GenAI Infrastructure & Agent Systems for Engineering Efficiency Mountain View, California...  ...a clear path to AVs at commercial scale, empowering a safer, richer, and...  ..., MCP integrations) enabling reliable, production-grade AI agents Autoresearch... 

    Nuro

    Mountain View, CA
    2 days ago
  • $168.93k - $192.5k

     ...We are seeking a Site Reliability Engineer to join our Core...  ...processes required to safely scale, deploy, and operate...  .... You'll focus on infrastructure automation, observability...  ..., and observability systems that enhance uptime...  ...-call rotations and lead incident response... 
    Full time
    Temporary work
    Work at office
    Remote work
    Flexible hours

    ID.me

    Mountain View, CA
    2 days ago
  • $176k - $420k

     ...What to Expect As a Software Engineer for the Optimus team, you will build the tools and infrastructure to make and measure improvements to neural network architecture...  ...breakthroughs into robotic intelligence at scale. The systems you create will drive continuous data... 
    Hourly pay
    Full time
    Temporary work
    Flexible hours

    Tesla

    Palo Alto, CA
    4 days ago
  •  ...Job Title: OpenRan System Engineer Location: Menlo Park, CA Duration: 6 Months...  ...and deploying telecom network infrastructure. Through organizations like the Telecom...  ...to the unconnected while significantly scaling current networks ? then this is a great... 
    Flexible hours

    TriOptus LLC

    Atherton, CA
    19 hours ago
  •  ...Title: Electrical Engineer - High Voltage Distribution Systems Location...  ...operation and reliability of high-voltage...  ...Development, Data Analytics Infrastructure & Cloud Solutions,...  ...deliver industry-leading capabilities to...  ...enabling them to scale with flexibility,... 
    Full time
    Relocation package
    3 days per week

    InterSources

    San Mateo, CA
    19 hours ago
  • $165k - $190k

     ...Role We are seeking a Lead Product Manager, GTM Systems & Partner Channel...  ...roadmap for partner portal infrastructure, Salesforce platform evolution...  ..., Finance, Legal, and Engineering translating complex stakeholder...  ...content delivery at scale Architect and govern partner... 

    Qualys

    San Mateo, CA
    19 hours ago
  • $167.4k - $209.3k

     ...Rivian Systems-Minded Designer Rivian is on a mission to keep...  ...designer with a passion for scaling design across platforms and...  ...Design Systems, Product, and Engineering to align visual frameworks with...  ...for the role of design infrastructure in product velocity and craft... 
    Full time
    Contract work
    Part time
    Local area

    Rivian

    Palo Alto, CA
    3 days ago
  • $150k - $250k

     ...builds advanced radar systems to help humanity...  ...response, infrastructure resilience, and mission...  ...Space Systems Engineer to join our team...  ...and data domains Lead system design reviews...  ...fleet management scale) Familiarity...  ...‑level fault and reliability analysis (fishbones... 
    Permanent employment
    Full time
    Remote work

    Array Labs

    Redwood City, CA
    1 day ago
  • A technology firm is seeking a Test Engineer to work with Google's test engineering team. Responsibilities include creating test plans...  ...ideal candidate will have strong experience in testing large-scale systems and proficiency in Unix/Linux or Windows. Excellent... 

    TechDigital Group

    Mountain View, CA
    4 days ago
  • $178.1k - $230k

     ...US and Dubai, we're now scaling manufacturing and...  ...In this role, you'll lead Joby's energy storage and distribution system software team. This system...  ...supported by Joby's systems engineering and broader software...  ...BMS requirements into reliable, high-performance code.... 
    Temporary work

    Joby Aviation

    San Carlos, CA
    3 days ago
  • $186k - $280k

     ...is the world’s digital infrastructure company®, shortening...  ...The Senior Requirements Engineer for AI Agents &...  ..., incident management systems, and policy enforcement...  ...equivalents — able to design reliable, observable event-...  ...deploy, operate, and scale agent services in a cloud... 
    Full time
    Work at office

    Equinix

    Redwood City, CA
    3 days ago
  •  ...Systems Engineering Intern About the Role: We're looking for a Systems...  ...passionate about Linux , infrastructure tooling, and systems-level...  ...help automate, manage, and scale internal systems using...  ...strong curiosity about system reliability and security . While this... 
    Internship

    Bear Robotics, Inc.

    Redwood City, CA
    3 days ago
  •  ...are seeking an experienced Electrical Engineer to support the design and validation of power systems for large-scale data center infrastructure. This role will focus on both low voltage...  ..., supporting the development of reliable, scalable power architectures for mission... 

    Insight Global

    Mountain View, CA
    2 days ago
  • $93.5k - $137.9k

     ...Alto, California, is looking for an experienced engineering professional adept in managing complex systems and infrastructure. This role emphasizes technical proficiency,...  ...of this position in optimizing system performance and reliability. #J-18808-Ljbffr Valleywise Health

    Valleywise Health

    Palo Alto, CA
    1 day ago
  • $150k - $250k

     ...our Founding Security Reliability Engineer at Charta Health, you'...  ...opportunity to build and scale the foundational security infrastructure that powers our...  ...mindset, ensuring our systems are not only available...  ...Automation & DevSecOps: Lead efforts to automate security... 

    Charta Health

    San Mateo, CA
    3 days ago
  • $190k - $240k

     ...seeking an experienced backend software engineer to enhance their lifecycle-...  ...service. The successful candidate will lead design efforts to scale the platform, mentor team engineers,...  ...design, and knowledge of distributed systems. The position supports remote work, ensuring... 
    Remote work

    Affirm

    Palo Alto, CA
    13 days ago
  • $150k - $180k

     ...is seeking a highly experienced IT systems administrator to lead a team supporting the Aerospace, Research...  ...contractors to ensure that ARTS IT infrastructure is fully compliant with security...  ...systems administration, computer engineering, or other related fields... 
    Full time
    Contract work
    For contractors
    Visa sponsorship

    Metis Technology Solutions, Inc.

    Mountain View, CA
    19 hours ago
  • $132.5k - $338.3k

     ...on technical leader who runs toward complexity. You have deep infrastructure expertise across servers, storage, networking, and cloud, and...  ...comfortable advising clients in the boardroom and rebuilding systems in the back room. You take ownership of tools and playbooks,... 
    Work experience placement
    Live in
    Work at office
    Local area

    Accenture

    Mountain View, CA
    2 hours ago
  • $140k - $312k

     ...continued success depends on Engineers being able to develop, debug...  ...services, tools, and build infrastructure directly impact over 1000 vehicle...  ...by enhancing the speed and reliability of Over-the-Air updates,...  ...building a distributed compute system, running code on hundreds of... 
    Hourly pay
    Full time
    Temporary work
    Flexible hours

    Tesla

    Palo Alto, CA
    4 days ago
  •  ...Software Engineer Matroid is a full-service computer vision company that has developed...  ...Engineer to help develop the systems & infrastructure that powers Matroid's computer vision...  ...computer vision platform Develop secure, reliable, scalable infrastructure projects;... 
    Work experience placement
    Work at office
    Flexible hours

    Matroid

    Palo Alto, CA
    3 days ago
  • $140k - $300k

     ...the physical world - delivering this at scale requires general-purpose robots (Robotaxis...  ...practices amongst the group, build tools helping engineers to write better code (for instance,...  ...Cuda/OpenCL, SIMD, multithreading, Linux system software (posix etc.), & computer vision... 
    Hourly pay
    Full time
    Temporary work
    Flexible hours

    Tesla

    Palo Alto, CA
    19 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Lead Infrastructure and Reliability Engineer (Systems & Scale). Be the first to apply!