Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior Site Reliability Engineer

$170k - $290k

Luma AI

About Luma AI

Luma's mission is to build multimodal AI to expand human imagination and capabilities. We believe that multimodality is critical for intelligence. This requires a massive, reliable, and performant GPU infrastructure that pushes the boundaries of scale. Our SRE team is the foundation of our research and product velocity, responsible for the thousands of NVIDIA and AMD GPUs across multiple providers that power our work.

Where You Come In

We are looking for a hands-on, first-principles engineer who is fluent in Linux, comfortable operating close to the metal, and capable of architecting systems for the next generation of AI infrastructure.

You will build, maintain, and scale Luma's infrastructure across on-prem and multi-vendor clouds (AWS & OCI), serving as the bridge between hardware vendors, cloud providers, and our research teams.

What You'll Do

  • Architect for Reliability & Scale: Participate in critical re-architecture sessions to redesign our systems for higher efficiency and scale. You won't just maintain existing clusters; you will help define how our next-generation infrastructure operates.
  • Own Multi-Cloud GPU Clusters: Take end-to-end ownership of our production clusters for training and inference across AWS and OCI, ensuring high availability and peak performance.
  • Drive Security & Compliance: Assist in achieving and maintaining security certifications (SOC 2 Type 1 & 2, ISO standards) by implementing robust infrastructure security practices in a fast-moving AI startup environment.
  • Deep Linux Performance Tuning: Use your mastery of Linux systems to troubleshoot and optimize performance at the OS and kernel level.
  • Build Robust Automation: Write high-quality tools and automation in Python, Go, or Bash to manage, monitor, and heal our infrastructure without relying on heavy operational toil.
  • Debug Complex Hardware/Software Failures: Serve as the final escalation point for the most challenging GPU, networking (InfiniBand/RDMA), and system-level issues, often collaborating directly with hardware vendors like NVIDIA.
Who You Are
  • 5+ years of experience as an SRE, production engineer, or infrastructure engineer in a fast-paced, large-scale environment.
  • Deep Linux Mastery: You possess deep, hands-on expertise in Linux, containerized systems, and debugging low-level system performance.
  • Expert in Technologies: You have working experiencewith Terraform, Airflow, and Ray
  • Cloud Infrastructure Expert: You have strong experience with providers like AWS or OCI.
  • Tenacious Troubleshooter: You thrive on solving complex, low-level problems where hardware and software intersect.
  • Startup DNA: You are energetic and thrive in a less structured, fast-paced environment.
  • Security-Minded: You possess a working knowledge of security best practices and familiarity with compliance frameworks, such as SOC 2 and ISO.
  • Expert in High-Performance Networking: You have practical experience with InfiniBand, RDMA, or RoCE and understand how to optimize throughput for massive distributed training jobs.
What Sets You Apart (Bonus Points)
  • Deep expertise with GPU tooling for NVIDIA and AMD GPUs like DCGM or ROCm.
  • Experience managing large-scale GPU clusters for AI/ML workloads (training or inference).
  • Familiarity with job management systems based on Kubernetes or orchestration frameworks like Ray.
  • Deep expertise in Data Pipeline and Infrastructure

Compensation

The base pay range for this role is $170,000 - $290,000 per year.

About Luma

Luma's mission is to build unified general intelligence that can generate, understand, and operate in the physical world.

We believe that multimodality is critical for intelligence. To go beyond language models and build more aware, capable and useful systems, the next step function change will come from vision. So, we are working on training and scaling up multimodal foundation models for systems that can see and understand, show and explain, and eventually interact with our world to effect change.
Vacancy posted 5 days ago
Similar jobs that could be interesting for youBased on the Senior Site Reliability Engineer in United States vacancy
  •  ...Joining a high-performing team remotely, the full-time Senior Site Reliability Engineer will own the reliability and automation of critical AI infrastructure, ensuring systems are resilient and secure while building automation tools to streamline operational workflows... 
    Senior
    Full time
    Remote work

    Virtual Vocations Inc

    United States
    5 days ago
  •  ...Site Reliability Engineers are responsible for ensuring the availability, reliability, scalability, and performance of the firm’s most critical customer-facing microservices that power all eCommerce channels. This role applies Google-inspired SRE principles to balance... 
    Senior
    Local area
    Remote work
    Flexible hours
    Shift work

    O'Reilly Technology Services, Inc.

    Pierce, ID
    3 days ago
  •  ...exceptional interactions, smarter decision-making, and accelerated growth in the AI-driven world. We’re looking for a Senior Site Reliability Engineer to help build and scale a high-impact SRE function. You’ll be a technical leader on a team responsible for improving system... 
    Senior

    Elea Ecuador

    Austin, TX
    2 days ago
  •  ...fast. Our infra has to match. The role We’re looking for a Senior SRE to own the reliability, scalability, and operational posture of Satsuma’s multi...  ...AI‑assisted development workflows Partner closely with engineering on reliability reviews and architecture decisions 5‑8... 
    Senior

    Satsuma

    Austin, TX
    2 days ago
  •  ...APPIT Software Solutions is hiring a Senior Site Reliability Engineer (SRE) in Seattle, USA . Lead site reliability engineering efforts for large-scale distributed systems, driving 99.99% availability targets through advanced observability, automation, and resilience... 
    Senior
    Flexible hours

    Appit LLC

    Seattle, WA
    2 days ago
  •  ...critical services in a new public cloud platform? Join our IaaS Site Reliability Engineering (SRE) team. We design, develop, and operate infrastructure...  ...of a clear career path in our SRE team: SRE I → SRE II → Senior → Senior II → Principal → Senior Principal. Each step... 
    Senior
    Work at office
    Remote work

    Akamai

    New York, NY
    1 day ago
  •  ...Mango, Inc. Senior Site Reliability Engineer Los Angeles, CA·Full time We are seeking a Senior Site Reliability Engineer to own and evolve the infrastructure that supports our on‑premise instruments, data systems, and machine learning pipelines. This role combines systems... 
    Senior
    Full time

    Mango

    Los Angeles, CA
    2 days ago
  •  ...Our client, a leading organization in the technology and systems operations sector, is seeking a dedicated and skilled Senior Site Reliability Engineer to join their dynamic team. As a Senior Site Reliability Engineer, you will be an integral part of the Systems... 
    Senior
    Weekly pay

    ManpowerGroup Global, Inc.

    Charlotte, NC
    2 days ago
  • $200k - $240k

     ...expertise across machine learning, UI/UX, large language models, and medicine. Job Description We’re hiring an experienced Site Reliability Engineer for our Boston or NYC office! You can expect to: Design, build, and maintain resilient, scalable, and secure... 
    Senior
    Work at office

    Verana Health

    New York, NY
    3 days ago
  • $160k - $195k

     ...federal, state and local agencies fuels the RapidSOS HARMONY AI engine that delivers this intelligence to those who need it most....  ...What this role is about Are you excited to work on systems where reliability directly impacts real‑world outcomes? At RapidSOS, we build... 
    Senior
    Local area
    Flexible hours

    RapidSOS

    New York, NY
    3 days ago
  • $65 - $75 per hour

     ...Confluence, and IT Service Management tools. Description: As an Engineer 2, you will collaborate with management, departments, and...  ...event management, and automation across the IT organization. Seniority level Mid-Senior level Employment type Contract Job function Information... 
    Senior
    Contract work
    Remote work

    SBS Creatix

    New York, NY
    1 day ago
  •  ...technology excites you? Join our Compute Site Reliability team! Our team is responsible for...  ...products and platform. In collaboration with Engineering and Product teams, we focus on...  ...services and influence their evolution. As a Senior Site Reliability Engineer, you will be... 
    Senior
    Work at office

    Akamai

    Indiana, PA
    2 days ago
  • $150k - $200k

     ...gamifying everyday life, you’ll thrive in our fast‑moving, collaborative environment. About the Role We are looking for a Senior Site Reliability Engineer to help ensure the reliability, scalability, and performance of the infrastructure that powers favorited’s real‑time... 
    Senior
    Full time

    Favorited

    Santa Monica, CA
    2 days ago
  • $152k - $195k

     ...class investors including Silver Lake Waterman, Moody’s, Sequoia Capital, GV and Riverwood Capital. About the Team As a Senior Site Reliability Engineer, you will be a key technical leader driving the design and optimization of our Kubernetes‑based infrastructure and CI/... 
    Senior

    Zoomcar

    Austin, TX
    1 day ago
  •  ...United Wholesale Mortgageis hiring a Senior Site Reliability Engineer for a 100% on-site position in Pontiac, MI. Duties: Provide cutting edge monitoring solutions for our applications and infrastructure, support on outages and reduction of the mean time to resolution... 
    Senior

    United Wholesale Mortgage

    Pontiac, MI
    3 days ago
  •  ...What You’ll Do Responsible for leading a talented team of SREs/DevOps Engineers across a wide variety of Cloud Services to ensure the reliability, availability, and performance of software systems and infrastructure. What You’ll Need Master's degree in Computer Science... 
    Senior
    Full time

    NAB Leadership Foundation

    San Antonio, TX
    2 days ago
  • $150k - $170k

     ...Senior Site Reliability Engineer – Zip Co Join to apply for the Senior Site Reliability Engineer role at Zip Co At Zip, we build cloud‑native software applications that serve millions of customers and process billions of dollars in payments. We’re looking for a seasoned... 
    Senior
    Casual work
    Work at office
    Remote work
    Flexible hours

    ZIP

    New York, NY
    3 days ago
  •  ...about this role, we encourage you to apply. The Role As a Senior Platform Engineer, you are a champion for DevOps and SRE culture and industry...  ...are met. What You Will Be Doing Improving production reliability and system resilience within an SRE scoped team Championing... 
    Senior
    Flexible hours

    Megaport

    Dover, FL
    5 days ago
  • $175k - $190k

     ...This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer - AWS in United States. This role sits at the core of a fast-growing, AI-driven engineering environment focused on building highly reliable... 
    Senior
    Full time
    Temporary work

    Jobgether

    New York, NY
    1 day ago
  • $125k - $165k

     ...capacity for consumer ease. For more information, visit or follow us on LinkedIn. About the Role We're looking for a Senior Site Reliability Engineer who genuinely enjoys the craft. Someone who takes pride in a clean Terraform module, cares about observability because... 
    Senior
    Temporary work
    Remote work

    DexCare

    New York, NY
    2 days ago
  •  ...passionate about building unified IT solutions that simplify the way IT organizations work. We are currently looking for a Senior Site Reliability Engineer to join our SRE team in the Platform Engineering organization and help us scale our products to millions of end-users.... 
    Senior
    Permanent employment
    Remote work
    Work from home
    Flexible hours

    NinjaOne

    Austin, TX
    2 days ago
  •  ...Overview: Senior Site Reliability Engineer (SRE) Location: Chicago, IL (Onsite) Type: Contract Role Overview: We are seeking a Senior Site Reliability Engineer (SRE) with strong expertise in AWS infrastructure, automation, observability, and production... 
    Senior
    Contract work

    Purple Drive

    Chicago, IL
    2 days ago
  •  ...crisis across the globe, and we’re honored to support first responders. And this is where you come in. We're seeking a Senior Site Reliability Engineer who can own our data tier at high availability while also pulling weight across the broader platform. As Zello scales,... 
    Senior
    Permanent employment
    Local area
    Flexible hours

    Zello

    Austin, TX
    2 days ago
  •  ...constantly striving to make the most reliable and scalable systems possible to ensure...  ...ahead and we’re looking for a passionate Site Reliability Engineer to join our team in Dallas, TX or...  ...years of progressive experience as a Senior SRE or DevOps Lead (or equivalent role... 
    Senior
    Local area

    Traveltechessentialist

    Austin, TX
    2 days ago
  •  ...Senior Site Reliability Engineer – Azure Cloud Join to apply for the Senior Site Reliability Engineer role at Concord Technologies Concord Technologies is growing! Currently seeking a full‑time Senior Site Reliability Engineer (Sr. SRE) , with experience engineering solutions... 
    Senior
    Full time
    Local area
    Immediate start
    Remote work
    Flexible hours

    Concord Technologies

    New York, NY
    1 day ago
  •  ...Responsibilities Reliability Engineering & Operations - Own and improve service reliability through SLO/SLI definition, error budgets, and operational best practices. Design, implement, and maintain observability (monitoring, logging, tracing, alerting) to reduce MTTR... 
    Senior

    Castleton Commodities International

    Stamford, CT
    2 days ago
  • $110.7k - $171.8k

     ...components Participation in oncall rotation as a platform reliability escalation point Incident response, postincident reviews,...  ...compliance, and internal control requirements. Collaborate with engineering teams across the organization to influence platform adoption,... 
    Senior
    Work experience placement
    Work at office
    Local area

    Visa

    Austin, TX
    2 days ago
  • $150k - $200k

     ...parts of eye care and continue shaping the future of practice management. About the Role We are looking for a seasoned Senior Site Reliability Engineer to join our dynamic team in a foundational role, owning reliability and infrastructure as our first SRE. This role... 
    Senior
    Work experience placement
    Remote work

    Barti

    New York, NY
    1 day ago
  • $164.3k - $222.3k

     ...your career. This position is based in our Reston, VA office and offers a hybrid work schedule. Verisign is hiring a Senior Site Reliability Engineer to help lead a team responsible for building, managing, maintaining, and scaling the Linux infrastructure on which our... 
    Senior
    Work at office
    Flexible hours

    Accreditation Council For Graduate Medical Education

    Reston, VA
    2 days ago
  • $125.04k - $187.56k

     ...services, including Finance, Legal, Sustainability, Commercial, Digital and E-commerce, Technology and more. Overview The Site Reliability Engineer (SRE) III is responsible for ensuring the scalability, reliability, and performance of production systems through automation... 
    Senior
    Full time
    Work at office
    Remote work
    Flexible hours

    ViziRecruiter

    Quincy, MA
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Site Reliability Engineer. Be the first to apply!