Senior Site Reliability Engineer
$170k - $290kLuma AI
About Luma AI Luma's mission is to build multimodal AI to expand human imagination and capabilities. We believe that multimodality is critical for intelligence. This requires a massive, reliable, and performant GPU infrastructure that pushes the boundaries of scale. Our SRE team is the foundation of our research and product velocity, responsible for the thousands of NVIDIA and AMD GPUs across multiple providers that power our work. Where You Come In We are looking for a hands-on, first-principles engineer who is fluent in Linux, comfortable operating close to the metal, and capable of architecting systems for the next generation of AI infrastructure. You will build, maintain, and scale Luma's infrastructure across on-prem and multi-vendor clouds (AWS & OCI), serving as the bridge between hardware vendors, cloud providers, and our research teams. What You'll Do
- Architect for Reliability & Scale: Participate in critical re-architecture sessions to redesign our systems for higher efficiency and scale. You won't just maintain existing clusters; you will help define how our next-generation infrastructure operates.
- Own Multi-Cloud GPU Clusters: Take end-to-end ownership of our production clusters for training and inference across AWS and OCI, ensuring high availability and peak performance.
- Drive Security & Compliance: Assist in achieving and maintaining security certifications (SOC 2 Type 1 & 2, ISO standards) by implementing robust infrastructure security practices in a fast-moving AI startup environment.
- Deep Linux Performance Tuning: Use your mastery of Linux systems to troubleshoot and optimize performance at the OS and kernel level.
- Build Robust Automation: Write high-quality tools and automation in Python, Go, or Bash to manage, monitor, and heal our infrastructure without relying on heavy operational toil.
- Debug Complex Hardware/Software Failures: Serve as the final escalation point for the most challenging GPU, networking (InfiniBand/RDMA), and system-level issues, often collaborating directly with hardware vendors like NVIDIA.
- 5+ years of experience as an SRE, production engineer, or infrastructure engineer in a fast-paced, large-scale environment.
- Deep Linux Mastery: You possess deep, hands-on expertise in Linux, containerized systems, and debugging low-level system performance.
- Expert in Technologies: You have working experiencewith Terraform, Airflow, and Ray
- Cloud Infrastructure Expert: You have strong experience with providers like AWS or OCI.
- Tenacious Troubleshooter: You thrive on solving complex, low-level problems where hardware and software intersect.
- Startup DNA: You are energetic and thrive in a less structured, fast-paced environment.
- Security-Minded: You possess a working knowledge of security best practices and familiarity with compliance frameworks, such as SOC 2 and ISO.
- Expert in High-Performance Networking: You have practical experience with InfiniBand, RDMA, or RoCE and understand how to optimize throughput for massive distributed training jobs.
- Deep expertise with GPU tooling for NVIDIA and AMD GPUs like DCGM or ROCm.
- Experience managing large-scale GPU clusters for AI/ML workloads (training or inference).
- Familiarity with job management systems based on Kubernetes or orchestration frameworks like Ray.
- Deep expertise in Data Pipeline and Infrastructure
Vacancy posted 5 days ago
Similar jobs that could be interesting for youBased on the Senior Site Reliability Engineer in United States vacancy
- ...Joining a high-performing team remotely, the full-time Senior Site Reliability Engineer will own the reliability and automation of critical AI infrastructure, ensuring systems are resilient and secure while building automation tools to streamline operational workflows...SeniorFull timeRemote work
- ...Site Reliability Engineers are responsible for ensuring the availability, reliability, scalability, and performance of the firm’s most critical customer-facing microservices that power all eCommerce channels. This role applies Google-inspired SRE principles to balance...SeniorLocal areaRemote workFlexible hoursShift work
- ...exceptional interactions, smarter decision-making, and accelerated growth in the AI-driven world. We’re looking for a Senior Site Reliability Engineer to help build and scale a high-impact SRE function. You’ll be a technical leader on a team responsible for improving system...Senior
- ...fast. Our infra has to match. The role We’re looking for a Senior SRE to own the reliability, scalability, and operational posture of Satsuma’s multi... ...AI‑assisted development workflows Partner closely with engineering on reliability reviews and architecture decisions 5‑8...Senior
- ...APPIT Software Solutions is hiring a Senior Site Reliability Engineer (SRE) in Seattle, USA . Lead site reliability engineering efforts for large-scale distributed systems, driving 99.99% availability targets through advanced observability, automation, and resilience...SeniorFlexible hours
- ...critical services in a new public cloud platform? Join our IaaS Site Reliability Engineering (SRE) team. We design, develop, and operate infrastructure... ...of a clear career path in our SRE team: SRE I → SRE II → Senior → Senior II → Principal → Senior Principal. Each step...SeniorWork at officeRemote work
- ...Mango, Inc. Senior Site Reliability Engineer Los Angeles, CA·Full time We are seeking a Senior Site Reliability Engineer to own and evolve the infrastructure that supports our on‑premise instruments, data systems, and machine learning pipelines. This role combines systems...SeniorFull time
- ...Our client, a leading organization in the technology and systems operations sector, is seeking a dedicated and skilled Senior Site Reliability Engineer to join their dynamic team. As a Senior Site Reliability Engineer, you will be an integral part of the Systems...SeniorWeekly pay
$200k - $240k
...expertise across machine learning, UI/UX, large language models, and medicine. Job Description We’re hiring an experienced Site Reliability Engineer for our Boston or NYC office! You can expect to: Design, build, and maintain resilient, scalable, and secure...SeniorWork at office$160k - $195k
...federal, state and local agencies fuels the RapidSOS HARMONY AI engine that delivers this intelligence to those who need it most.... ...What this role is about Are you excited to work on systems where reliability directly impacts real‑world outcomes? At RapidSOS, we build...SeniorLocal areaFlexible hours$65 - $75 per hour
...Confluence, and IT Service Management tools. Description: As an Engineer 2, you will collaborate with management, departments, and... ...event management, and automation across the IT organization. Seniority level Mid-Senior level Employment type Contract Job function Information...SeniorContract workRemote work- ...technology excites you? Join our Compute Site Reliability team! Our team is responsible for... ...products and platform. In collaboration with Engineering and Product teams, we focus on... ...services and influence their evolution. As a Senior Site Reliability Engineer, you will be...SeniorWork at office
$150k - $200k
...gamifying everyday life, you’ll thrive in our fast‑moving, collaborative environment. About the Role We are looking for a Senior Site Reliability Engineer to help ensure the reliability, scalability, and performance of the infrastructure that powers favorited’s real‑time...SeniorFull time$152k - $195k
...class investors including Silver Lake Waterman, Moody’s, Sequoia Capital, GV and Riverwood Capital. About the Team As a Senior Site Reliability Engineer, you will be a key technical leader driving the design and optimization of our Kubernetes‑based infrastructure and CI/...Senior- ...United Wholesale Mortgageis hiring a Senior Site Reliability Engineer for a 100% on-site position in Pontiac, MI. Duties: Provide cutting edge monitoring solutions for our applications and infrastructure, support on outages and reduction of the mean time to resolution...Senior
- ...What You’ll Do Responsible for leading a talented team of SREs/DevOps Engineers across a wide variety of Cloud Services to ensure the reliability, availability, and performance of software systems and infrastructure. What You’ll Need Master's degree in Computer Science...SeniorFull time
$150k - $170k
...Senior Site Reliability Engineer – Zip Co Join to apply for the Senior Site Reliability Engineer role at Zip Co At Zip, we build cloud‑native software applications that serve millions of customers and process billions of dollars in payments. We’re looking for a seasoned...SeniorCasual workWork at officeRemote workFlexible hours- ...about this role, we encourage you to apply. The Role As a Senior Platform Engineer, you are a champion for DevOps and SRE culture and industry... ...are met. What You Will Be Doing Improving production reliability and system resilience within an SRE scoped team Championing...SeniorFlexible hours
$175k - $190k
...This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer - AWS in United States. This role sits at the core of a fast-growing, AI-driven engineering environment focused on building highly reliable...SeniorFull timeTemporary work$125k - $165k
...capacity for consumer ease. For more information, visit or follow us on LinkedIn. About the Role We're looking for a Senior Site Reliability Engineer who genuinely enjoys the craft. Someone who takes pride in a clean Terraform module, cares about observability because...SeniorTemporary workRemote work- ...passionate about building unified IT solutions that simplify the way IT organizations work. We are currently looking for a Senior Site Reliability Engineer to join our SRE team in the Platform Engineering organization and help us scale our products to millions of end-users....SeniorPermanent employmentRemote workWork from homeFlexible hours
- ...Overview: Senior Site Reliability Engineer (SRE) Location: Chicago, IL (Onsite) Type: Contract Role Overview: We are seeking a Senior Site Reliability Engineer (SRE) with strong expertise in AWS infrastructure, automation, observability, and production...SeniorContract work
- ...crisis across the globe, and we’re honored to support first responders. And this is where you come in. We're seeking a Senior Site Reliability Engineer who can own our data tier at high availability while also pulling weight across the broader platform. As Zello scales,...SeniorPermanent employmentLocal areaFlexible hours
- ...constantly striving to make the most reliable and scalable systems possible to ensure... ...ahead and we’re looking for a passionate Site Reliability Engineer to join our team in Dallas, TX or... ...years of progressive experience as a Senior SRE or DevOps Lead (or equivalent role...SeniorLocal area
- ...Senior Site Reliability Engineer – Azure Cloud Join to apply for the Senior Site Reliability Engineer role at Concord Technologies Concord Technologies is growing! Currently seeking a full‑time Senior Site Reliability Engineer (Sr. SRE) , with experience engineering solutions...SeniorFull timeLocal areaImmediate startRemote workFlexible hours
- ...Responsibilities Reliability Engineering & Operations - Own and improve service reliability through SLO/SLI definition, error budgets, and operational best practices. Design, implement, and maintain observability (monitoring, logging, tracing, alerting) to reduce MTTR...Senior
$110.7k - $171.8k
...components Participation in oncall rotation as a platform reliability escalation point Incident response, postincident reviews,... ...compliance, and internal control requirements. Collaborate with engineering teams across the organization to influence platform adoption,...SeniorWork experience placementWork at officeLocal area$150k - $200k
...parts of eye care and continue shaping the future of practice management. About the Role We are looking for a seasoned Senior Site Reliability Engineer to join our dynamic team in a foundational role, owning reliability and infrastructure as our first SRE. This role...SeniorWork experience placementRemote work$164.3k - $222.3k
...your career. This position is based in our Reston, VA office and offers a hybrid work schedule. Verisign is hiring a Senior Site Reliability Engineer to help lead a team responsible for building, managing, maintaining, and scaling the Linux infrastructure on which our...SeniorWork at officeFlexible hours$125.04k - $187.56k
...services, including Finance, Legal, Sustainability, Commercial, Digital and E-commerce, Technology and more. Overview The Site Reliability Engineer (SRE) III is responsible for ensuring the scalability, reliability, and performance of production systems through automation...SeniorFull timeWork at officeRemote workFlexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior Site Reliability Engineer. Be the first to apply!
Related searches
- site reliability engineer remote United States
- site reliability engineer United States
- lead site reliability engineer United States
- site reliability engineer sre United States
- site reliability engineering manager United States
- senior learning manager United States
- senior data management analyst United States
- senior app developer United States
- senior manager insurance United States
- senior game producer United States

