Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Software Engineer - Data Infra Reliability

$170k - $360k

Luma AI

Software Engineer - Data Infra Reliability

As our models scale to "omni" capabilities, our data infrastructure must be unbreakable. We are looking for a Data Reliability Engineer who brings a Site Reliability Engineering (SRE) mindset to the world of massive-scale data. You will be responsible for the resilience, automation, and scalability of the petabyte-scale pipelines that feed our research. This is not just about keeping the lights on; it's about treating infrastructure as code and building self-healing data systems that allow our researchers to train on massive datasets without interruption. Whether you are a junior engineer with a passion for automation or a seasoned SRE veteran, you will play a critical role in hardening the backbone of Luma's intelligence.

Automate Everything: Apply Infrastructure-as-Code (IaC) principles using Terraform to provision, manage, and scale our data infrastructure.

Harden Data Pipelines: Build reliability and fault tolerance into our core data ingestion and processing workflows, ensuring high availability for research jobs.

Scale Kubernetes & Ray: Operate and optimize large-scale Kubernetes clusters and Ray deployments to handle bursty, high-throughput workloads.

Define Reliability: Establish Service Level Objectives (SLOs) and observability standards (Prometheus/Grafana) for our data platforms.

Debug & Heal: Serve as the first line of defense for complex infrastructure failures, diagnosing root causes in distributed storage and compute systems.

Deep SRE/DevOps proficiency: You live and breathe Linux, networking, and automation.

Infrastructure-as-Code Native: You have extensive experience with Terraform, Ansible, or similar tools to manage complex cloud environments (AWS/GCP).

Kubernetes Expert: You have managed Kubernetes in production and understand its internals, not just how to deploy containers.

Python Proficiency: You can write high-quality Python code for automation, tooling, and infrastructure management.

Data-Minded: You understand the specific challenges of stateful data systems and high-throughput storage (S3/Object Store).

Experience managing GPU clusters or AI/ML workloads.

Background in both Software Engineering and Operations (DevOps).

Experience with high-performance networking (InfiniBand/RDMA).

The base pay range for this role is $170,000 – $360,000 per year.

Luma's mission is to build unified general intelligence that can generate, understand, and operate in the physical world. We believe that multimodality is critical for intelligence. To go beyond language models and build more aware, capable and useful systems, the next step function change will come from vision. So, we are working on training and scaling up multimodal foundation models for systems that can see and understand, show and explain, and eventually interact with our world to effect change.

Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Software Engineer - Data Infra Reliability in Palo Alto, CA vacancy
  • $170k - $216k

     ...U.S. states. The Planner/Perception Reliability team builds out architectures, tools, and...  ...reliability and is accountable for onboard software health while ensuring high development...  ...you will report to a Staff Software Engineer / Tech Lead Manager. You will: Architect... 
    Suggested
    Full time
    Immediate start
    Remote work

    Waymo

    Mountain View, CA
    1 day ago
  • $213k - $263k

     ...ML workflows manageable and reliable. This team also partners closely...  ...and contribute to Waymo's data infrastructure platform to...  ...models via data store and data infra ecosystem. Work closely...  ...experience in the field of software engineering ~ Experience programming in... 
    Suggested
    Full time
    Remote work

    Waymo

    Mountain View, CA
    3 days ago
  • $238k - $302k

     ...Senior Software Engineer, ML Evaluation Infra and Efficiency Waymo is an autonomous driving technology company...  ...that can scale across compute, data, and environments to improve model...  ...computations, ensuring scalability and reliability across distributed environments.... 
    Suggested
    Full time
    Remote work

    Waymo

    Mountain View, CA
    17 hours ago
  •  ...technology delivery partner is hiring an AI Quality Infrastructure Engineer in Mountain View, California. This full-time role involves...  ...frameworks for large-scale AI operations, with a focus on reliability and system excellence. Candidates should possess a degree in Computer... 
    Suggested
    Full time
    H1b
    Visa sponsorship

    NewsNowGh

    Mountain View, CA
    4 days ago
  •  ...Engineering Role at Latica At Latica, our goal is to unlock the value of data to transform patient care. We're building a secure data...  ...tradeoffs between performance, reliability, maintainability, and cost,...  ...+ years building production software systems; care deeply about... 
    Suggested

    Latica

    Palo Alto, CA
    1 day ago
  • $200k - $287.5k

     ...observability platform built on the Snowflake AI Data Cloud and engineered for scale. We ingest and store logs,...  ...of telemetry daily while maintaining reliability at enterprise scale. As part of...  .... We are hiring a Senior Software Engineer for the Observe Data Management... 
    Flexible hours

    Snowflake Computing

    Menlo Park, CA
    1 day ago
  • $180k - $220k

     ...Software Engineer, Data Los Angeles, Palo Alto, San Francisco About HeyGen At HeyGen, our mission is to make visual storytelling accessible...  ..., enhancing storage and computation efficiency. Data Reliability & Observability: Implement data quality checks, data... 
    Work experience placement

    HeyGen

    Palo Alto, CA
    1 day ago
  • $196k - $230k

     ...are high, and so are the rewards. The Data Engineering team builds and maintains the...  ...decision-making across Robinhood. We design reliable, scalable data systems that support product...  ...end-to-end data pipelines * Hands-on software engineering experience, with the ability... 
    Work at office
    Flexible hours
    Shift work
    3 days per week

    Robinhood

    Menlo Park, CA
    2 days ago
  •  ...humanoid robots — from high-performance, software-defined hardware to the foundational...  .... We're looking for a Senior ML & Data Infrastructure Engineer to own and scale the systems that...  ...clips with strong guarantees around reliability, latency, and cost efficiency Design... 
    Immediate start

    Rhoda AI

    Palo Alto, CA
    3 days ago
  • $144k - $216k

     ...new listings every day, we're just getting started. As a Software Engineer, Data, you will be developing and enhancing our marketplace...  ...Write comprehensive data quality tests to ensure data reliability Work closely with data teams to implement complex data... 
    Work at office
    Work from home
    Flexible hours
    2 days per week
    3 days per week

    Mercari

    Palo Alto, CA
    1 day ago
  • $162.8k - $203.5k

     ...Rivian Senior Data Engineer Rivian is on a mission to keep the world adventurous forever...  ...contribute to the implementation of scalable, reliable, and secure data pipelines, remaining...  ...of experience in data engineering, software engineering, or distributed systems. Proven... 
    Full time
    Contract work
    Temporary work
    Part time
    Local area
    Shift work

    Rivian

    Palo Alto, CA
    17 hours ago
  •  ...Software Engineer II-1 The Business Experimentation and Optimization (BE&O) teams within Mastercard...  ...users around the world to make data-driven decisions through advanced analytics...  ...skills while helping the team deliver reliable, high-quality software. Our teams... 
    Immediate start

    Dynamic Yield

    Mountain View, CA
    3 days ago
  • $165.2k - $223.6k

     ...Spark, Python and other runtime engines. We are scaling the backend...  ...our team: - Be part of big data revolution in cloud - Be...  ...industry best-practices to produce reliable, fault-torrent and dependable...  ...non-internship professional software development experience - 2+... 
    Internship
    Local area
    Flexible hours

    Amazon

    East Palo Alto, CA
    1 day ago
  •  ...innovation. We lead in intelligent data infrastructure—delivering...  ...meet performance, scale, reliability, and enterprise-readiness requirements...  ...property. Coach and mentor engineers across the team (including...  ...of industry experience in software development. 5 years of experience... 
    Work at office
    Local area

    NetApp

    Mountain View, CA
    2 days ago
  • $168.93k - $192.5k

     ...more, visit Role Overview ID.me is seeking a Software Development Engineer III to join the Data Acquisition & Normalization team. This team is...  ...and normalization services that ensure ID.me delivers reliable, real-time validation of identity attributes at internet... 
    Full time
    Temporary work
    Work at office
    Remote work
    Flexible hours

    ID.me

    Mountain View, CA
    2 days ago
  • $281k - $356k

     ...Senior Staff Software Engineer, Perception Data Waymo is an autonomous driving technology company with...  ...building the automated "flywheels" and "infra-as-product" solutions that transform...  ...problems, ensuring our models can reliably understand the long-tail of rare events... 
    Full time
    Remote work

    Waymo

    Mountain View, CA
    1 day ago
  • $160.36k - $240.54k

     ...Software Engineer, ML Data Infrastructure Mountain View, California (HQ) Nuro is a self-driving technology company on a mission to make...  ...of autonomous driving systems by creating a scalable and reliable data infrastructure. This infrastructure is designed to produce... 
    Work experience placement

    Nuro

    Mountain View, CA
    1 day ago
  • $206.5k - $258.1k

     ...Summary The Autonomy org at Rivian is seeking a Staff Software Engineer, Data Ops to join the Data team who can provide expertise...  ...automated workflows. Build and optimize highly reliable, scalable, and distributed infra using microservice architecture. Collaborate... 
    Full time
    Contract work
    Temporary work
    Part time
    Local area
    Shift work

    Rivian

    Palo Alto, CA
    17 hours ago
  • $180k - $197k

     ...Software Engineer, Data Infrastructure Mountain View, California Intrinsic is an AI robotics group at Google aiming to reimagine the potential...  ...Forward Design, develop, and maintain scalable and reliable data pipelines for collecting, processing, and storing... 
    Full time
    Local area

    Intrinsic

    Mountain View, CA
    4 days ago
  • $165k - $242k

     ...Senior Software Engineer, Data Center Infrastructure Tooling CoreWeave is The Essential Cloud for...  ...engineers, and operations, and other infra teams the ability to plan, visualize,...  ..., CI/CD pipelines, observability, and reliability practices. What We're Looking For... 
    Temporary work
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    1 day ago
  •  ...Software Engineer - Data Infrastructure Services Sunnyvale, CA / Bellevue, WA CoreWeave is The...  ...infrastructures for CoreWeave. The data infra includes but is not limited to...  ...Improve the performance, security, reliability, and scalability of our data platforms... 
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Remote work
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    1 day ago
  • $275.8k - $340.5k

     ...About the team: The AV ML Infra team at GM builds ML infrastructure...  ...as Embodied AI, Simulation, Data Science, and more. We enable...  ...enhance the productivity of ML engineers, and drive the adoption of...  ...simulation workloads and managing reliable ML inference pipelines. ML... 
    Local area
    Remote work
    Work from home
    Relocation
    Relocation package
    Flexible hours

    General Motors

    Mountain View, CA
    3 days ago
  • $147k - $211k

    PMax and Automation Infra Software Engineer Google, Mountain View, CA, USA Bachelor’s degree or equivalent...  ...-scale system design, networking and data storage, security, artificial...  ...passionate about building highly scalable, reliable, and intelligent systems using... 
    Full time

    Google Inc.

    Mountain View, CA
    1 day ago
  • $240k - $280k

     ...highly motivated, and focused on engineering excellence. This organization...  ...discovery. High-quality data is fundamental to every stage...  ...work at the intersection of software, data, infrastructure, and machine...  ...models train effectively and reliably. As a Software Engineer on... 
    Temporary work

    Pantera Capital

    Palo Alto, CA
    3 days ago
  • $180k - $225k

     ...hiring a Machine Learning Infrastructure Engineer to help build the backbone that trains,...  ...improvements end-to-end-partnering with product and data teams, reducing latency and cost, and...  ...: making training faster and more reliable, improving model serving performance, and... 
    Full time
    Local area
    Work from home

    NewsBreak

    Mountain View, CA
    3 days ago
  • $180k - $250k

     ...own large models on their own data. The current industry...  ...an experienced Data Platform Engineer to join as a member of our core...  ...while ensuring scalability, reliability, and security. Architect, build...  ...have experience maintaining the infra that supports these. Proficiency... 
    Work at office
    Visa sponsorship
    Relocation package

    DatologyAI

    Redwood City, CA
    2 days ago
  • $153k - $222k

     ...the role We are looking for infrastructure engineers with expertise in scaling open-source data infrastructure to join the Data & ML infra group. This role will work across the...  ...hooks. Develop and deploy high-quality software using modern tooling and frameworks, especially... 
    Full time
    For contractors
    For subcontractor
    Casual work
    Work at office
    Remote work
    Day shift

    Decisive Point

    Mountain View, CA
    17 hours ago
  • $185k - $215k

     ...teammate to join us on this exciting journey. We are building the foundational data platform that powers reliable, scalable data across Mudflap's systems. As a Senior Software Engineer, Data Platforms , you'll play a critical role in designing and operating the... 
    Remote work

    Mudflap

    Palo Alto, CA
    1 day ago
  • $166k - $225k

     ...are passionate about enabling data teams to solve the world's...  ...improve their business. Founded by engineers — and customer obsessed — we...  ...SQL query engines. As a software engineer on the Runtime team...  ...Data Plane Storage : Provide reliable and high performance services... 
    Local area
    Worldwide

    Databricks Inc.

    Mountain View, CA
    1 day ago
  • $120k - $300k

     ...What to Expect As a Tool and Infrastructure Software Engineer of the Reliability and Test team, you will develop and transition software stack, on...  ...middleware to communicate with PXI test hardware and read back data Build robust and flexible Python tools to automate... 
    Hourly pay
    Full time
    Temporary work
    Flexible hours

    Tesla

    Palo Alto, CA
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Software Engineer - Data Infra Reliability. Be the first to apply!