Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Software Engineer - Data Infra Reliability

$170k - $360k

Luma AI

Software Engineer - Data Infra Reliability

As our models scale to "omni" capabilities, our data infrastructure must be unbreakable. We are looking for a Data Reliability Engineer who brings a Site Reliability Engineering (SRE) mindset to the world of massive-scale data. You will be responsible for the resilience, automation, and scalability of the petabyte-scale pipelines that feed our research. This is not just about keeping the lights on; it's about treating infrastructure as code and building self-healing data systems that allow our researchers to train on massive datasets without interruption. Whether you are a junior engineer with a passion for automation or a seasoned SRE veteran, you will play a critical role in hardening the backbone of Luma's intelligence.

What You'll Do
  • Automate Everything: Apply Infrastructure-as-Code (IaC) principles using Terraform to provision, manage, and scale our data infrastructure.
  • Harden Data Pipelines: Build reliability and fault tolerance into our core data ingestion and processing workflows, ensuring high availability for research jobs.
  • Scale Kubernetes & Ray: Operate and optimize large-scale Kubernetes clusters and Ray deployments to handle bursty, high-throughput workloads.
  • Define Reliability: Establish Service Level Objectives (SLOs) and observability standards (Prometheus/Grafana) for our data platforms.
  • Debug & Heal: Serve as the first line of defense for complex infrastructure failures, diagnosing root causes in distributed storage and compute systems.
Who You Are
  • Deep SRE/DevOps proficiency: You live and breathe Linux, networking, and automation.
  • Infrastructure-as-Code Native: You have extensive experience with Terraform, Ansible, or similar tools to manage complex cloud environments (AWS/GCP).
  • Kubernetes Expert: You have managed Kubernetes in production and understand its internals, not just how to deploy containers.
  • Python Proficiency: You can write high-quality Python code for automation, tooling, and infrastructure management.
  • Data-Minded: You understand the specific challenges of stateful data systems and high-throughput storage (S3/Object Store).
What Sets You Apart (Bonus Points)
  • Experience managing GPU clusters or AI/ML workloads.
  • Background in both Software Engineering and Operations (DevOps).
  • Experience with high-performance networking (InfiniBand/RDMA).
Compensation

The base pay range for this role is $170,000 – $360,000 per year.

Luma's mission is to build unified general intelligence that can generate, understand, and operate in the physical world. We believe that multimodality is critical for intelligence. To go beyond language models and build more aware, capable and useful systems, the next step function change will come from vision. So, we are working on training and scaling up multimodal foundation models for systems that can see and understand, show and explain, and eventually interact with our world to effect change.

Vacancy posted 3 days ago
Similar jobs that could be interesting for youBased on the Software Engineer - Data Infra Reliability in Palo Alto, CA vacancy
  • $170k - $216k

     ....S. states. The Planner/Perception Reliability team builds out architectures, tools, and...  ...and is accountable for onboard software health while ensuring high development velocity...  ...role you will report to a Staff Software Engineer / Tech Lead Manager. You will:... 
    Suggested
    Full time
    Immediate start
    Remote work

    Waymo

    Mountain View, CA
    2 days ago
  • $204k - $259k

     ...hybrid role, you will report to the Senior Engineering Manager of Semantics. You Will:...  ...and implement new features in the VLM data infra and validate the changes for the model...  ...professional experience in the field of software engineering ~ Proficiency in C++ ~ Experience... 
    Suggested
    Full time
    Work at office
    Remote work

    Waymo

    Mountain View, CA
    12 hours ago
  • $213k - $263k

     ...ML workflows manageable and reliable. This team also partners closely...  ...and contribute to Waymo's data infrastructure platform to...  ...models via data store and data infra ecosystem. Work closely...  ...experience in the field of software engineering ~ Experience programming in... 
    Suggested
    Full time
    Remote work

    Waymo

    Mountain View, CA
    2 days ago
  • $165.2k - $223.6k

     ...to crunch through exabytes of data in the cloud per day to make...  ...are looking for the innovative engineers to help shape the future of...  ...non-internship professional software development experience ~2+...  ...architecture (design patterns, reliability and scaling) of new and existing... 
    Suggested
    Internship
    Local area
    Flexible hours

    Amazon

    Palo Alto, CA
    1 day ago
  • $144k - $216k

     ...new listings every day, we're just getting started. As a Software Engineer, Data, you will be developing and enhancing our marketplace...  ...Write comprehensive data quality tests to ensure data reliability Work closely with data teams to implement complex data... 
    Suggested
    Work at office
    Work from home
    Flexible hours
    2 days per week
    3 days per week

    Mercari

    Palo Alto, CA
    3 days ago
  • $162.8k - $203.5k

     ...Rivian Senior Data Engineer Rivian is on a mission to keep the world adventurous forever...  ...contribute to the implementation of scalable, reliable, and secure data pipelines, remaining...  ...of experience in data engineering, software engineering, or distributed systems. Proven... 
    Full time
    Contract work
    Temporary work
    Part time
    Local area
    Shift work

    Rivian

    Palo Alto, CA
    2 days ago
  • $165.2k - $223.6k

     ...Come build the future of data streaming with the Amazon Data Firehose (ADF) team...  .... We are looking for a Software Development Engineer for the Amazon Data Firehose Team. The...  ...design or architecture (design patterns, reliability and scaling) of new and existing systems... 
    Internship
    Local area
    Flexible hours

    Amazon

    East Palo Alto, CA
    3 days ago
  • $168.93k - $192.5k

     ...more, visit Role Overview ID.me is seeking a Software Development Engineer III to join the Data Acquisition & Normalization team. This team is...  ...and normalization services that ensure ID.me delivers reliable, real-time validation of identity attributes at internet... 
    Full time
    Temporary work
    Work at office
    Remote work
    Flexible hours

    ID.me

    Mountain View, CA
    4 days ago
  • $180k - $220k

     ...Software Engineer, Data Los Angeles, Palo Alto, San Francisco About HeyGen At HeyGen, our mission is to make visual storytelling accessible...  ..., enhancing storage and computation efficiency. Data Reliability & Observability: Implement data quality checks, data... 
    Work experience placement

    HeyGen

    Palo Alto, CA
    1 day ago
  • $196k - $230k

     ...are high, and so are the rewards. The Data Engineering team builds and maintains the...  ...decision-making across Robinhood. We design reliable, scalable data systems that support product...  ...end-to-end data pipelines * Hands-on software engineering experience, with the ability... 
    Work at office
    Flexible hours
    Shift work
    3 days per week

    Robinhood

    Menlo Park, CA
    4 days ago
  •  ...Engineering Role at Latica At Latica, our goal is to unlock the value of data to transform patient care. We're building a secure data...  ...tradeoffs between performance, reliability, maintainability, and cost,...  ...+ years building production software systems; care deeply about... 

    Latica

    Palo Alto, CA
    3 days ago
  • $200k - $287.5k

     ...observability platform built on the Snowflake AI Data Cloud and engineered for scale. We ingest and store logs,...  ...of telemetry daily while maintaining reliability at enterprise scale. As part of...  .... We are hiring a Senior Software Engineer for the Observe Data Management... 
    Flexible hours

    Snowflake Computing

    Menlo Park, CA
    4 days ago
  •  ...Software Engineer - Data Infrastructure Services Sunnyvale, CA / Bellevue, WA CoreWeave is The...  ...infrastructures for CoreWeave. The data infra includes but is not limited to...  ...Improve the performance, security, reliability, and scalability of our data platforms... 
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Remote work
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    3 days ago
  • $160.36k - $240.54k

     ...Software Engineer, ML Data Infrastructure Mountain View, California (HQ) Nuro is a self-driving technology company on a mission to make...  ...of autonomous driving systems by creating a scalable and reliable data infrastructure. This infrastructure is designed to produce... 
    Work experience placement

    Nuro

    Mountain View, CA
    3 days ago
  • $206.5k - $258.1k

     ...Summary The Autonomy org at Rivian is seeking a Staff Software Engineer, Data Ops to join the Data team who can provide expertise...  ...automated workflows. Build and optimize highly reliable, scalable, and distributed infra using microservice architecture. Collaborate... 
    Full time
    Contract work
    Temporary work
    Part time
    Local area
    Shift work

    Rivian

    Palo Alto, CA
    2 days ago
  • $165k - $242k

     ...Senior Software Engineer, Data Center Infrastructure Tooling CoreWeave is The Essential Cloud for...  ...engineers, and operations, and other infra teams the ability to plan, visualize,...  ..., CI/CD pipelines, observability, and reliability practices. What We're Looking For... 
    Temporary work
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    3 days ago
  • $180k - $197k

     ...Software Engineer, Data Infrastructure Mountain View, California Intrinsic is an AI robotics group at Google aiming to reimagine the potential...  ...Forward Design, develop, and maintain scalable and reliable data pipelines for collecting, processing, and storing... 
    Full time
    Local area

    Intrinsic

    Mountain View, CA
    1 day ago
  • $281k - $356k

     .... The Perception Data team at Waymo is responsible...  ..."flywheels" and "infra-as-product" solutions...  ...ensuring our models can reliably understand the long-...  ...report to a Director of Engineering You will:...  ...~10+ years of software engineering experience... 
    Full time
    Remote work

    Waymo

    Mountain View, CA
    3 days ago
  • $180k - $225k

     ...hiring a Machine Learning Infrastructure Engineer to help build the backbone that trains,...  ...improvements end-to-end-partnering with product and data teams, reducing latency and cost, and...  ...: making training faster and more reliable, improving model serving performance, and... 
    Full time
    Local area
    Work from home

    NewsBreak

    Mountain View, CA
    5 days ago
  • $175k - $215k

     ...tens of billions in simulation across 15+ U.S. states. Software Engineering builds the brains of Waymo's fully autonomous driving technology...  ...vehicles.Experience designing and implementing robust, reliable APIs for core geospatial or logistics services.Experience with... 
    Full time
    Remote work

    Waymo

    Mountain View, CA
    3 days ago
  • $238k - $302k

     ...core to our autonomous driving software. We help our partners by...  ...driving. We are looking for engineers with ML system expertise to help...  ...can scale across compute, data, and environments to improve...  ...computations, ensuring scalability and reliability across distributed... 
    Full time
    Remote work

    Waymo

    Mountain View, CA
    2 days ago
  • $185k - $215k

     ...teammate to join us on this exciting journey. We are building the foundational data platform that powers reliable, scalable data across Mudflap's systems. As a Senior Software Engineer, Data Platforms , you'll play a critical role in designing and operating the... 
    Remote work

    Mudflap

    Palo Alto, CA
    3 days ago
  • $120k - $300k

     ...What to Expect As a Tool and Infrastructure Software Engineer of the Reliability and Test team, you will develop and transition software stack, on...  ...middleware to communicate with PXI test hardware and read back data Build robust and flexible Python tools to automate... 
    Hourly pay
    Full time
    Temporary work
    Flexible hours

    Tesla

    Palo Alto, CA
    5 days ago
  • $180k - $250k

     ...own large models on their own data. The current industry...  ...an experienced Data Platform Engineer to join as a member of our core...  ...while ensuring scalability, reliability, and security Architect, build...  ...have experience maintaining the infra that supports these.... 
    Full time
    Work at office
    Visa sponsorship
    Relocation package

    Datologyai

    Redwood City, CA
    12 hours ago
  • $155k - $185k

     ...Opportunity We are looking for an experienced Software Engineer with a passion for building robust and scalable data infrastructure to join our Data Platform team....  ...information into actionable intelligence — efficiently, reliably, and at scale. If you're excited about building... 
    Permanent employment
    Full time

    Otter.ai

    Mountain View, CA
    16 hours ago
  • $175k - $215k

     ...driving over 100 million miles on public roads and tens of billions in simulation across 15+ U.S. states. Waymo's software reliability engineers (SRE) are responsible for the stable operation of Waymo's fully autonomous systems and supporting infrastructure. As an... 
    Full time
    Remote work

    Waymo

    Mountain View, CA
    2 days ago
  • $180k

     ...small, highly motivated, and focused on engineering excellence. This organization is for individuals...  ...research and systems teams to deliver reliable, ultra-scalable infrastructure that...  ...xAI is an equal opportunity employer. For details on data processing, view our... 
    Full time
    Temporary work

    Xai

    Palo Alto, CA
    12 hours ago
  •  ...Software Engineer - Data Center Emulator Location: On-site, Santa Clara, CA Overview: Seeking an experienced contract software engineer...  ...precise, unambiguous specifications that can be implemented reliably and verified against clear acceptance criteria. ~ Strong... 
    Contract work
    For contractors
    Immediate start

    Diverse Lynx

    Santa Clara, CA
    3 days ago
  • $181.1k - $318.4k

     ...Senior Software Engineer, Control/Data Plane Apple is where individual imaginations gather together, committing to the values that lead to great...  ...infrastructure to build scalable, highly available, and reliable services that operate seamlessly. We actively listen to diverse... 
    Relocation

    Apple

    Sunnyvale, CA
    3 days ago
  • $147.4k - $272.1k

     ...Software Development Engineer - Data The Apple Services Engineering team is one of the most exciting examples of Apple's long-held passion for...  ...for high-volume commerce data Ensure data quality, reliability, and observability (metrics, monitoring, validation)... 
    Relocation
    Shift work

    Apple

    Cupertino, CA
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Software Engineer - Data Infra Reliability. Be the first to apply!