Lead Infrastructure and Reliability Engineer (Systems & Scale)
$230k - $360kLuma AI
About Luma AI A new class of intelligence is emerging, systems that understand and generate the world across video, images, audio, and language. Building multimodal AGI is not just a modeling challenge. It is an infrastructure challenge at the edge of what hardware, software, and organizations can support. At Luma, we operate rapidly scaling 10k+ GPU fleets, pushing utilization, throughput, and reliability hard enough that yesterday's solutions break regularly. Researchers depend on this infrastructure to move the frontier forward. Customers depend on it to power real creative work. Many companies run accelerators. Very few sit directly next to the teams inventing the models that redefine what those accelerators must do. At Luma, improvements to scheduling, efficiency, and reliability immediately translate into faster research iteration and entirely new product capabilities. We are still early. The playbook is still being written. A single exceptional engineer can reshape how the company operates. Where You Come In Our Infrastructure Engineering team is a systems engineering group with company-level responsibility. At Luma, reliability engineers work directly with the researchers and products pushing the limits of multimodal intelligence. We operate close to the metal:
- Kernels
- Containers
- Schedulers
- Networking
- Storage
- GPU behavior
- Architect and operate large, heterogeneous GPU environments under extreme demand
- Improve utilization and performance where small gains materially change company outcomes
- Resolve failures that span hardware, OS, runtimes, and orchestration
- Eliminate entire classes of instability
- Build mechanisms that make heroics unnecessary
- Define how infrastructure and workloads evolve as cluster size and concurrency grow
- Design scheduling, placement, and resource management approaches for increasingly complex jobs
- Work directly with research to build the systems required for new model capabilities
- Ensure inference platforms scale rapidly without sacrificing reliability or latency
- Anticipate where today's abstractions will fail and redesign ahead of them
- Hire and develop exceptional systems and reliability engineers
- Set the bar for technical depth, judgment, and production ownership
- Shape architecture early through strong partnerships with research and product
- Translate reliability constraints into long-term platform strategy
- Deep expertise in Linux and distributed systems
- Experience operating GPU / accelerator clusters in real production environments
- Strong fluency in Kubernetes and modern open-source infrastructure
- Comfortable debugging across hardware → kernel → runtime → orchestration
- You understand how systems behave under contention and at scale
- You write code and build automation
- You think in bottlenecks, failure modes, and tradeoffs
- Engineers trust your judgment, especially when things break
- You raise reliability standards across the company
- You influence product and research architecture early
- You build strong partnerships, not ticket queues
- You attract and level up exceptional engineers
- You are curious how models use infrastructure, because improving systems expands what becomes possible
- How research progresses
- How products scale
- How customers trust us
- And how the engineering organization grows
Vacancy posted 3 days ago
Similar jobs that could be interesting for youBased on the Lead Infrastructure and Reliability Engineer (Systems & Scale) in Redwood City, CA vacancy
$174k - $252k
Senior Software Engineer, Infrastructure, Google Store corporate_fare Google place Mountain... ...with developing large-scale infrastructure, distributed systems or networks, or experience with... ...global eCommerce platform, our leading edge retail point-of-sale system...SuggestedFull time- Poshmark, Inc. is seeking a talented Site Reliability Engineer to ensure the health and performance of our web-scale systems. You will collaborate with development teams... ...Operations and a deep understanding of cloud infrastructure. Responsibilities include managing...Suggested
$150k - $230k
...Senior Systems Engineer - AI Infrastructure On Site, Palo Alto, California About the... ...systems that run at scale. This is a systems building... ...large-scale GPU training more reliable and efficient Debug... ...Senior Expectations ~ Lead design of significant system...Suggested- A leading AI infrastructure company in California is seeking a Member of Technical... ...and optimize large-scale AI inference systems. The role demands 5+ years in systems engineering and expertise in large-scale... ...to debug and drive the reliability of infrastructure....SuggestedFlexible hours
$176.75k - $252.5k
...We are seeking a Lead Systems & Data Architect with deep... .../ML and LLM workloads at scale. This is a highly visible... ...the intersection of data engineering, cloud infrastructure, analytics, and AI. You will... ...strategies. DevOps, Reliability & Security Lead the...SuggestedFull timeLocal areaFlexible hours$42.07 - $58.89 per hour
...LV battery management systems (BMS) on vehicle platforms. As an engineer on this team, you will... ...implement firmware validation infrastructure, equipment, automation... ...impact on the safety, reliability, robustness, and value... ...user experience Scale existing validation...Full timeTemporary workPart timeInternshipFlexible hours$180k - $320k
...Description About the role Own the infrastructure that engineering depends on — Kubernetes clusters, CI/... ...accelerator program from first silicon through scale-out. What you'll do Own the... ...administration, Bazel build systems, ML-platform infrastructure (training...H1bVisa sponsorshipWork visa- A leading AI infrastructure company in California seeks a Member of Technical Staff — Training... ...to design and optimize large-scale distributed training systems for frontier AI models.... ...with researchers and improving the reliability of long-running training jobs. Competitive...
$198k - $326k
...of our world-class software engineering team, you will take the lead in building the next-generation infrastructure and platforms for LinkedIn,... ...algorithms, API design and systems design, and your passion for... ...code that performs at massive scale. LinkedIn has pioneered many...For contractorsWork at officeFlexible hours$160.36k - $240.54k
...Senior Software Engineer – GenAI Infrastructure & Agent Systems for Engineering Efficiency Mountain View, California... ...a clear path to AVs at commercial scale, empowering a safer, richer, and... ..., MCP integrations) enabling reliable, production-grade AI agents Autoresearch...$168.93k - $192.5k
...We are seeking a Site Reliability Engineer to join our Core... ...processes required to safely scale, deploy, and operate... .... You'll focus on infrastructure automation, observability... ..., and observability systems that enhance uptime... ...-call rotations and lead incident response...Full timeTemporary workWork at officeRemote workFlexible hours$176k - $420k
...What to Expect As a Software Engineer for the Optimus team, you will build the tools and infrastructure to make and measure improvements to neural network architecture... ...breakthroughs into robotic intelligence at scale. The systems you create will drive continuous data...Hourly payFull timeTemporary workFlexible hours- ...Job Title: OpenRan System Engineer Location: Menlo Park, CA Duration: 6 Months... ...and deploying telecom network infrastructure. Through organizations like the Telecom... ...to the unconnected while significantly scaling current networks ? then this is a great...Flexible hours
- ...Title: Electrical Engineer - High Voltage Distribution Systems Location... ...operation and reliability of high-voltage... ...Development, Data Analytics Infrastructure & Cloud Solutions,... ...deliver industry-leading capabilities to... ...enabling them to scale with flexibility,...Full timeRelocation package3 days per week
$165k - $190k
...Role We are seeking a Lead Product Manager, GTM Systems & Partner Channel... ...roadmap for partner portal infrastructure, Salesforce platform evolution... ..., Finance, Legal, and Engineering translating complex stakeholder... ...content delivery at scale Architect and govern partner...$167.4k - $209.3k
...Rivian Systems-Minded Designer Rivian is on a mission to keep... ...designer with a passion for scaling design across platforms and... ...Design Systems, Product, and Engineering to align visual frameworks with... ...for the role of design infrastructure in product velocity and craft...Full timeContract workPart timeLocal area$150k - $250k
...builds advanced radar systems to help humanity... ...response, infrastructure resilience, and mission... ...Space Systems Engineer to join our team... ...and data domains Lead system design reviews... ...fleet management scale) Familiarity... ...‑level fault and reliability analysis (fishbones...Permanent employmentFull timeRemote work- A technology firm is seeking a Test Engineer to work with Google's test engineering team. Responsibilities include creating test plans... ...ideal candidate will have strong experience in testing large-scale systems and proficiency in Unix/Linux or Windows. Excellent...
$178.1k - $230k
...US and Dubai, we're now scaling manufacturing and... ...In this role, you'll lead Joby's energy storage and distribution system software team. This system... ...supported by Joby's systems engineering and broader software... ...BMS requirements into reliable, high-performance code....Temporary work$186k - $280k
...is the world’s digital infrastructure company®, shortening... ...The Senior Requirements Engineer for AI Agents &... ..., incident management systems, and policy enforcement... ...equivalents — able to design reliable, observable event-... ...deploy, operate, and scale agent services in a cloud...Full timeWork at office- ...Systems Engineering Intern About the Role: We're looking for a Systems... ...passionate about Linux , infrastructure tooling, and systems-level... ...help automate, manage, and scale internal systems using... ...strong curiosity about system reliability and security . While this...Internship
- ...are seeking an experienced Electrical Engineer to support the design and validation of power systems for large-scale data center infrastructure. This role will focus on both low voltage... ..., supporting the development of reliable, scalable power architectures for mission...
$93.5k - $137.9k
...Alto, California, is looking for an experienced engineering professional adept in managing complex systems and infrastructure. This role emphasizes technical proficiency,... ...of this position in optimizing system performance and reliability. #J-18808-Ljbffr Valleywise Health$150k - $250k
...our Founding Security Reliability Engineer at Charta Health, you'... ...opportunity to build and scale the foundational security infrastructure that powers our... ...mindset, ensuring our systems are not only available... ...Automation & DevSecOps: Lead efforts to automate security...$190k - $240k
...seeking an experienced backend software engineer to enhance their lifecycle-... ...service. The successful candidate will lead design efforts to scale the platform, mentor team engineers,... ...design, and knowledge of distributed systems. The position supports remote work, ensuring...Remote work$150k - $180k
...is seeking a highly experienced IT systems administrator to lead a team supporting the Aerospace, Research... ...contractors to ensure that ARTS IT infrastructure is fully compliant with security... ...systems administration, computer engineering, or other related fields...Full timeContract workFor contractorsVisa sponsorship$132.5k - $338.3k
...on technical leader who runs toward complexity. You have deep infrastructure expertise across servers, storage, networking, and cloud, and... ...comfortable advising clients in the boardroom and rebuilding systems in the back room. You take ownership of tools and playbooks,...Work experience placementLive inWork at officeLocal area$140k - $312k
...continued success depends on Engineers being able to develop, debug... ...services, tools, and build infrastructure directly impact over 1000 vehicle... ...by enhancing the speed and reliability of Over-the-Air updates,... ...building a distributed compute system, running code on hundreds of...Hourly payFull timeTemporary workFlexible hours- ...Software Engineer Matroid is a full-service computer vision company that has developed... ...Engineer to help develop the systems & infrastructure that powers Matroid's computer vision... ...computer vision platform Develop secure, reliable, scalable infrastructure projects;...Work experience placementWork at officeFlexible hours
$140k - $300k
...the physical world - delivering this at scale requires general-purpose robots (Robotaxis... ...practices amongst the group, build tools helping engineers to write better code (for instance,... ...Cuda/OpenCL, SIMD, multithreading, Linux system software (posix etc.), & computer vision...Hourly payFull timeTemporary workFlexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Lead Infrastructure and Reliability Engineer (Systems & Scale). Be the first to apply!
Related searches
- lead operating engineer Redwood City, CA
- lead engineer Redwood City, CA
- remote infrastructure engineer Redwood City, CA
- data infrastructure engineer Redwood City, CA
- senior infrastructure engineer Redwood City, CA
- infrastructure engineer Redwood City, CA
- infrastructure developer Redwood City, CA
- healthcare systems engineer Redwood City, CA
- systems engineer Redwood City, CA
- operations support system engineer Redwood City, CA


