Lead Infrastructure and Reliability Engineer (Systems & Scale)

$230k - $360k

Luma AI

About Luma AI

A new class of intelligence is emerging, systems that understand and generate the world across video, images, audio, and language.

Building multimodal AGI is not just a modeling challenge. It is an infrastructure challenge at the edge of what hardware, software, and organizations can support.

At Luma, we operate rapidly scaling 10k+ GPU fleets, pushing utilization, throughput, and reliability hard enough that yesterday's solutions break regularly. Researchers depend on this infrastructure to move the frontier forward. Customers depend on it to power real creative work.

Many companies run accelerators. Very few sit directly next to the teams inventing the models that redefine what those accelerators must do.

At Luma, improvements to scheduling, efficiency, and reliability immediately translate into faster research iteration and entirely new product capabilities.

We are still early. The playbook is still being written. A single exceptional engineer can reshape how the company operates.

Where You Come In

Our Infrastructure Engineering team is a systems engineering group with company-level responsibility. At Luma, reliability engineers work directly with the researchers and products pushing the limits of multimodal intelligence.

We operate close to the metal:

Kernels
Containers
Schedulers
Networking
Storage
GPU behavior

But we are also responsible for something bigger:

Turning deep systems knowledge into repeatable, scalable reliability for the entire company. We are hiring a leader who will define that direction. You will be a technical authority, an organizational force multiplier, and a magnet for other great engineers.

What You'll Own

Reliability of the Frontier

Architect and operate large, heterogeneous GPU environments under extreme demand
Improve utilization and performance where small gains materially change company outcomes
Resolve failures that span hardware, OS, runtimes, and orchestration
Eliminate entire classes of instability
Build mechanisms that make heroics unnecessary

Scaling Training & Inference

Define how infrastructure and workloads evolve as cluster size and concurrency grow
Design scheduling, placement, and resource management approaches for increasingly complex jobs
Work directly with research to build the systems required for new model capabilities
Ensure inference platforms scale rapidly without sacrificing reliability or latency
Anticipate where today's abstractions will fail and redesign ahead of them

Building the Organization

Hire and develop exceptional systems and reliability engineers
Set the bar for technical depth, judgment, and production ownership
Shape architecture early through strong partnerships with research and product
Translate reliability constraints into long-term platform strategy

Who You Are

Required:

Deep expertise in Linux and distributed systems
Experience operating GPU / accelerator clusters in real production environments
Strong fluency in Kubernetes and modern open-source infrastructure
Comfortable debugging across hardware → kernel → runtime → orchestration
You understand how systems behave under contention and at scale
You write code and build automation
You think in bottlenecks, failure modes, and tradeoffs
Engineers trust your judgment, especially when things break

Important: This role requires comfort operating close to upstream and close to the metal. If most of your experience has been inside highly abstracted internal platforms where others owned the underlying machinery, this is unlikely to be a match.

Leadership Expectations

You raise reliability standards across the company
You influence product and research architecture early
You build strong partnerships, not ticket queues
You attract and level up exceptional engineers
You are curious how models use infrastructure, because improving systems expands what becomes possible

Why This Role Is Special

Most infrastructure roles optimize mature systems. This one helps define how reliability works for a new generation of AI infrastructure.

The decisions you make here will influence:

How research progresses
How products scale
How customers trust us
And how the engineering organization grows

If you want to build the reliability foundations of a company operating at the technological frontier, we should talk.

Compensation

The base pay range for this role is $230,000 - $360,000 per year.

About Luma

Luma's mission is to build unified general intelligence that can generate, understand, and operate in the physical world.

We believe that multimodality is critical for intelligence. To go beyond language models and build more aware, capable and useful systems, the next step function change will come from vision. So, we are working on training and scaling up multimodal foundation models for systems that can see and understand, show and explain, and eventually interact with our world to effect change.

Apply

Vacancy posted 3 days ago

Similar jobs that could be interesting for youBased on the Lead Infrastructure and Reliability Engineer (Systems & Scale) in Redwood City, CA vacancy

Senior Infrastructure Engineer — Large-Scale AI & Systems
$174k - $252k
Senior Software Engineer, Infrastructure, Google Store corporate_fare Google place Mountain... ...with developing large-scale infrastructure, distributed systems or networks, or experience with... ...global eCommerce platform, our leading edge retail point-of-sale system...
Suggested
Full time
Google Inc.
Mountain View, CA
1 day ago
Senior Site Reliability Engineer: Scale, Automation & Cloud
Poshmark, Inc. is seeking a talented Site Reliability Engineer to ensure the health and performance of our web-scale systems. You will collaborate with development teams... ...Operations and a deep understanding of cloud infrastructure. Responsibilities include managing...
Suggested
Poshmark, Inc.
Redwood City, CA
1 day ago
Senior Systems Engineer - AI Infrastructure
$150k - $230k
...Senior Systems Engineer - AI Infrastructure On Site, Palo Alto, California About the... ...systems that run at scale. This is a systems building... ...large-scale GPU training more reliable and efficient Debug... ...Senior Expectations ~ Lead design of significant system...
Suggested
Clockwork Systems
Palo Alto, CA
19 hours ago
Senior Systems Engineering
A leading AI infrastructure company in California is seeking a Member of Technical... ...and optimize large-scale AI inference systems. The role demands 5+ years in systems engineering and expertise in large-scale... ...to debug and drive the reliability of infrastructure....
Suggested
Flexible hours
RadixArk
Palo Alto, CA
4 days ago
Lead Systems & Data Architect
$176.75k - $252.5k
...We are seeking a Lead Systems & Data Architect with deep... .../ML and LLM workloads at scale. This is a highly visible... ...the intersection of data engineering, cloud infrastructure, analytics, and AI. You will... ...strategies. DevOps, Reliability & Security Lead the...
Suggested
Full time
Local area
Flexible hours
RingCentral
Belmont, CA
3 days ago
Internship, System Modeling & Infrastructure Engineer, Low Voltage Power (Fall 2026)
$42.07 - $58.89 per hour
...LV battery management systems (BMS) on vehicle platforms. As an engineer on this team, you will... ...implement firmware validation infrastructure, equipment, automation... ...impact on the safety, reliability, robustness, and value... ...user experience Scale existing validation...
Full time
Temporary work
Part time
Internship
Flexible hours
Tesla
Palo Alto, CA
2 days ago
Site Reliability / Infrastructure Engineer
$180k - $320k
...Description About the role Own the infrastructure that engineering depends on — Kubernetes clusters, CI/... ...accelerator program from first silicon through scale-out. What you'll do Own the... ...administration, Bazel build systems, ML-platform infrastructure (training...
H1b
Visa sponsorship
Work visa
DensityAI
Mountain View, CA
4 days ago
Staff ML Systems Engineer — Distributed Training at Scale
A leading AI infrastructure company in California seeks a Member of Technical Staff — Training... ...to design and optimize large-scale distributed training systems for frontier AI models.... ...with researchers and improving the reliability of long-running training jobs. Competitive...
RadixArk
Palo Alto, CA
2 days ago
Sr. Staff Software Engineer - Systems Infrastructure
$198k - $326k
...of our world-class software engineering team, you will take the lead in building the next-generation infrastructure and platforms for LinkedIn,... ...algorithms, API design and systems design, and your passion for... ...code that performs at massive scale. LinkedIn has pioneered many...
For contractors
Work at office
Flexible hours
LinkedIn
Mountain View, CA
19 hours ago
Senior Software Engineer - GenAI Infrastructure & Agent Systems for Engineering Efficiency
$160.36k - $240.54k
...Senior Software Engineer – GenAI Infrastructure & Agent Systems for Engineering Efficiency Mountain View, California... ...a clear path to AVs at commercial scale, empowering a safer, richer, and... ..., MCP integrations) enabling reliable, production-grade AI agents Autoresearch...
Nuro
Mountain View, CA
2 days ago
Site Reliability Engineer - Platform Infrastructure Engineering
$168.93k - $192.5k
...We are seeking a Site Reliability Engineer to join our Core... ...processes required to safely scale, deploy, and operate... .... You'll focus on infrastructure automation, observability... ..., and observability systems that enhance uptime... ...-call rotations and lead incident response...
Full time
Temporary work
Work at office
Remote work
Flexible hours
ID.me
Mountain View, CA
2 days ago
AI Systems Engineer, Tooling & Infrastructure, Optimus
$176k - $420k
...What to Expect As a Software Engineer for the Optimus team, you will build the tools and infrastructure to make and measure improvements to neural network architecture... ...breakthroughs into robotic intelligence at scale. The systems you create will drive continuous data...
Hourly pay
Full time
Temporary work
Flexible hours
Tesla
Palo Alto, CA
4 days ago
Hardware Systems Engineering
...Job Title: OpenRan System Engineer Location: Menlo Park, CA Duration: 6 Months... ...and deploying telecom network infrastructure. Through organizations like the Telecom... ...to the unconnected while significantly scaling current networks ? then this is a great...
Flexible hours
TriOptus LLC
Atherton, CA
19 hours ago
Electrical Engineer - High Voltage Distribution Systems
...Title: Electrical Engineer - High Voltage Distribution Systems Location... ...operation and reliability of high-voltage... ...Development, Data Analytics Infrastructure & Cloud Solutions,... ...deliver industry-leading capabilities to... ...enabling them to scale with flexibility,...
Full time
Relocation package
3 days per week
InterSources
San Mateo, CA
19 hours ago
Lead Product Manager, GTM Systems & Partner Channel Programs
$165k - $190k
...Role We are seeking a Lead Product Manager, GTM Systems & Partner Channel... ...roadmap for partner portal infrastructure, Salesforce platform evolution... ..., Finance, Legal, and Engineering translating complex stakeholder... ...content delivery at scale Architect and govern partner...
Qualys
San Mateo, CA
19 hours ago
Lead Product Designer - Design System Frameworks
$167.4k - $209.3k
...Rivian Systems-Minded Designer Rivian is on a mission to keep... ...designer with a passion for scaling design across platforms and... ...Design Systems, Product, and Engineering to align visual frameworks with... ...for the role of design infrastructure in product velocity and craft...
Full time
Contract work
Part time
Local area
Rivian
Palo Alto, CA
3 days ago
Remote- Systems Engineering -
$150k - $250k
...builds advanced radar systems to help humanity... ...response, infrastructure resilience, and mission... ...Space Systems Engineer to join our team... ...and data domains Lead system design reviews... ...fleet management scale) Familiarity... ...‑level fault and reliability analysis (fishbones...
Permanent employment
Full time
Remote work
Array Labs
Redwood City, CA
1 day ago
Senior Test Engineer for Large-Scale Systems Automation
A technology firm is seeking a Test Engineer to work with Google's test engineering team. Responsibilities include creating test plans... ...ideal candidate will have strong experience in testing large-scale systems and proficiency in Unix/Linux or Windows. Excellent...
TechDigital Group
Mountain View, CA
4 days ago
Lead, Energy Storage & Distribution Systems Software + Battery IPT Lead
$178.1k - $230k
...US and Dubai, we're now scaling manufacturing and... ...In this role, you'll lead Joby's energy storage and distribution system software team. This system... ...supported by Joby's systems engineering and broader software... ...BMS requirements into reliable, high-performance code....
Temporary work
Joby Aviation
San Carlos, CA
3 days ago
Software Development Engineer -AI/Agentic Systems
$186k - $280k
...is the world’s digital infrastructure company®, shortening... ...The Senior Requirements Engineer for AI Agents &... ..., incident management systems, and policy enforcement... ...equivalents — able to design reliable, observable event-... ...deploy, operate, and scale agent services in a cloud...
Full time
Work at office
Equinix
Redwood City, CA
3 days ago
Software Engineering Intern, Systems
...Systems Engineering Intern About the Role: We're looking for a Systems... ...passionate about Linux , infrastructure tooling, and systems-level... ...help automate, manage, and scale internal systems using... ...strong curiosity about system reliability and security . While this...
Internship
Bear Robotics, Inc.
Redwood City, CA
3 days ago
Data Center Electrical Engineer - Power Systems (Low & Medium Voltage)
...are seeking an experienced Electrical Engineer to support the design and validation of power systems for large-scale data center infrastructure. This role will focus on both low voltage... ..., supporting the development of reliable, scalable power architectures for mission...
Insight Global
Mountain View, CA
2 days ago
Senior Network & Systems Engineer: Converged Infrastructure
$93.5k - $137.9k
...Alto, California, is looking for an experienced engineering professional adept in managing complex systems and infrastructure. This role emphasizes technical proficiency,... ...of this position in optimizing system performance and reliability. #J-18808-Ljbffr Valleywise Health
Valleywise Health
Palo Alto, CA
1 day ago
Founding Security Reliability Engineer
$150k - $250k
...our Founding Security Reliability Engineer at Charta Health, you'... ...opportunity to build and scale the foundational security infrastructure that powers our... ...mindset, ensuring our systems are not only available... ...Automation & DevSecOps: Lead efforts to automate security...
Charta Health
San Mateo, CA
3 days ago
Senior Backend Engineer - Distributed Systems Lead (Remote)
$190k - $240k
...seeking an experienced backend software engineer to enhance their lifecycle-... ...service. The successful candidate will lead design efforts to scale the platform, mentor team engineers,... ...design, and knowledge of distributed systems. The position supports remote work, ensuring...
Remote work
Affirm
Palo Alto, CA
13 days ago
Senior IT Systems and Networks Lead
$150k - $180k
...is seeking a highly experienced IT systems administrator to lead a team supporting the Aerospace, Research... ...contractors to ensure that ARTS IT infrastructure is fully compliant with security... ...systems administration, computer engineering, or other related fields...
Full time
Contract work
For contractors
Visa sponsorship
Metis Technology Solutions, Inc.
Mountain View, CA
19 hours ago
Infrastructure & Network Security Response & Recovery Lead
$132.5k - $338.3k
...on technical leader who runs toward complexity. You have deep infrastructure expertise across servers, storage, networking, and cloud, and... ...comfortable advising clients in the boardroom and rebuilding systems in the back room. You take ownership of tools and playbooks,...
Work experience placement
Live in
Work at office
Local area
Accenture
Mountain View, CA
2 hours ago
Distributed Systems Engineer, Build Infrastructure, Vehicle Software
$140k - $312k
...continued success depends on Engineers being able to develop, debug... ...services, tools, and build infrastructure directly impact over 1000 vehicle... ...by enhancing the speed and reliability of Over-the-Air updates,... ...building a distributed compute system, running code on hundreds of...
Hourly pay
Full time
Temporary work
Flexible hours
Tesla
Palo Alto, CA
4 days ago
Software Engineer, Systems & Infrastructure
...Software Engineer Matroid is a full-service computer vision company that has developed... ...Engineer to help develop the systems & infrastructure that powers Matroid's computer vision... ...computer vision platform Develop secure, reliable, scalable infrastructure projects;...
Work experience placement
Work at office
Flexible hours
Matroid
Palo Alto, CA
3 days ago
Software Engineer, C++ Generalist, AI Systems & Infrastructure
$140k - $300k
...the physical world - delivering this at scale requires general-purpose robots (Robotaxis... ...practices amongst the group, build tools helping engineers to write better code (for instance,... ...Cuda/OpenCL, SIMD, multithreading, Linux system software (posix etc.), & computer vision...
Hourly pay
Full time
Temporary work
Flexible hours
Tesla
Palo Alto, CA
19 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Lead Infrastructure and Reliability Engineer (Systems & Scale). Be the first to apply!