Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

RL Infra Engineer - Reliability, Observability & Scale

United States Digital Space LLC

United States Digital Space LLC in San Francisco is seeking a skilled Research Engineer to ensure the reliability and infrastructure integrity of AI training environments. This role emphasizes proactive reliability measures and critical evaluation metrics. The successful candidate will possess strong Python skills and a background in operating ML systems, coupled with the ability to enhance observability across training environments. The role involves working closely with researchers to maintain system stability as demand scales. #J-18808-Ljbffr United States Digital Space LLC

Vacancy posted 5 days ago
Similar jobs that could be interesting for youBased on the RL Infra Engineer - Reliability, Observability & Scale in San Francisco, CA vacancy
  • A leading AI research company in San Francisco is seeking a Software Engineer to enhance infrastructure supporting cutting-edge AI systems. The role involves designing reliable systems and optimizing performance for millions of users. Ideal candidates possess experience... 
    Suggested

    OpenAI

    San Francisco, CA
    5 days ago
  • $350k

    Mirendil is looking for engineers to build infrastructure for frontier reasoning models...  .... This role focuses on large-scale reinforcement learning (RL) model training and requires a solid...  ...principles. The ideal candidate will design reliable training infrastructure, implement... 
    Suggested

    Mirendil

    San Francisco, CA
    1 day ago
  •  ...in San Francisco seeks infrastructure engineers to enhance the tooling and systems...  ...include building GPU orchestration, scaling cloud batchjob systems, and designing...  ...infrastructure and a strong focus on reliability and observability. This position is in-person, and international... 
    Suggested
    Visa sponsorship

    Exa

    San Francisco, CA
    1 day ago
  • A leading AI research company based in San Francisco is seeking experienced reliability engineers to scale their infrastructure and ensure system performance and reliability. This role involves collaborating with diverse teams to develop resilient systems and enhance operations... 
    Suggested

    OpenAI

    San Francisco, CA
    4 days ago
  • $225k

     ...Manufacturing Co is looking for a Software Engineer on the Inference & RL Systems team in San Francisco. The...  ...performance, and ensuring high reliability for RL and post-training workflows....  ...fundamentals and experience with large-scale systems. Compensation includes a... 
    Suggested

    Dormont Manufacturing Co

    San Francisco, CA
    5 days ago
  • A dynamic tech firm located in San Francisco is seeking a Site Reliability Engineer to enhance operational health across their production systems. This high-impact role demands expertise in AWS and strong programming skills. You will manage production systems' reliability... 

    gamma.app

    San Francisco, CA
    2 days ago
  • Happyrobot Inc. is looking for an Infrastructure Engineer in San Francisco, California. This role involves leading the stability and observability of systems while debugging complex issues as they arise. Candidates should have over 3 years of experience with production... 

    Happyrobot Inc.

    San Francisco, CA
    4 days ago
  • $300k

    Aionia Group in San Francisco is seeking a Systems Infrastructure Engineer to build scalable infrastructure for RL experiments. This role offers a unique opportunity to work on innovative projects with leading researchers in a well-funded AI company. The ideal candidate... 

    Aionia Group

    San Francisco, CA
    4 days ago
  •  ...actively hiring as we continue to scale. About the role We're hiring a Site Reliability Engineer (SRE) to ensure the...  ...MTTR (Mean Time to Recovery) Observability & System Insight Design...  ...Analytics: Amplitude ML/Platform Infra: TrueFoundry What Success... 
    Work at office
    Remote work
    Flexible hours
    2 days per week

    Plenful

    San Francisco, CA
    3 days ago
  • $261k - $326k

     ...specializing in AI infrastructure is seeking a Principal Engineer to enhance reliability and scalability of cloud systems. This role demands over 1...  ...networking expertise and systems fundamentals, especially in high-scale environments. Competitive compensation includes a salary... 

    Crusoe

    San Francisco, CA
    4 days ago
  • $266k - $398k

     ...Director, Site Reliability Engineering – Infrastructure Platform Okta is...  ...to help us to continue to scale the service with great people...  ...You’ll Be Doing Lead the infra platform and shared services...  .... Build a world‑class observability platform and monitoring capabilities... 
    Permanent employment
    Flexible hours

    Okta, Inc.

    San Francisco, CA
    5 days ago
  • Algora Public Benefit Corporation is looking for an AI Cloud Infra Engineer to join their team in San Francisco. You will ensure the reliability of backend systems and work closely with engineers to plan for future growth. The ideal candidate has strong cloud infrastructure... 

    Algora Public Benefit Corporation

    San Francisco, CA
    5 days ago
  • $147k - $202k

     ...Overview: We are seeking a highly technical Staff Observability Site Reliability Engineer with a specialty in Splunk to own and evolve our Splunk...  ...: Eliminate "toil" by automating the deployment and scaling of observability agents and collectors. Required Skills... 
    Permanent employment
    Work at office
    Local area
    Worldwide
    Flexible hours

    Okta

    San Francisco, CA
    a month ago
  • Elea Ecuador seeks experienced engineers for its San Francisco HQ, focusing on scalability and reliability of systems. Candidates will design solutions for infrastructure...  ...Science and proven experience in a rapidly scaling environment, with strong skills in cloud infrastructure... 
    Worldwide
    Relocation package

    Elea Ecuador

    San Francisco, CA
    1 day ago
  •  ...looking for a world-class Site Reliability Engineer to ensure the reliability,...  ...that power agentic AI at scale. Your mission: keep our...  ...reliability posture end-to-end—observability, performance tuning,...  ...closely with the founders, the infra team, and the dev team—and... 

    Blaxel

    San Francisco, CA
    5 days ago
  •  ...leading language learning platform is seeking an experienced SRE Engineer to ensure the reliability and resilience of their infrastructure. Responsibilities include leading incident response, improving observability, and collaborating with various teams to enhance platform... 

    Speak

    San Francisco, CA
    2 days ago
  • Founding Platform & Reliability Engineer About OpenArt OpenArt is an AI Storytelling...  ...not slices. Ship at real scale, your work goes to millions...  ...hands-on implementation, observability, and cost optimization....  ...tradeoffs to non-infra peers clearly Ability to operate... 
    Remote work
    Worldwide
    Visa sponsorship

    Embedding VC

    San Francisco, CA
    1 day ago
  •  ...looking for an Infrastructure Engineer to take the lead on scaling our operational...  ...You’ll own the stability, observability, and debugging workflows...  ...role where you’ll shape how reliability is done - reducing incident...  ...deployment pipelines, CI/CD, or infra-as-code Experience... 
    Worldwide
    Shift work

    Happyrobot Inc.

    San Francisco, CA
    4 days ago
  • We’re looking for a Systems Reliability Engineer to own the reliability of our system...  ...responsible for making systems observable, diagnosable, and repeatable as we scale across deployments. You’ll work...  ...multiple layers of the stack (infra > services > network)... 
    Permanent employment

    Claryo

    San Francisco, CA
    2 days ago
  •  ...developer or data scientist can scale an ML application from...  ...looking for a Senior Site Reliability Engineer to join the Infrastructure...  ...your laptop. As part of the Infra team, we build the scalable...  ...performance, scalability, and observability of Anyscale-managed Ray... 

    Anyscale

    San Francisco, CA
    3 days ago
  • $125k - $195k

     ...team of exceptional, hands-on engineers to make this happen....  ...seeking an Infrastructure & Site Reliability Engineer to design, build,...  .... Our philosophy towards infra is minimal, understandable,...  ...compatible storage, VPNs Scale our observability platform: Build systems to... 
    Work at office
    Visa sponsorship
    Night shift

    Atomicsemi

    San Francisco, CA
    1 day ago
  •  ...’re hiring an SRE to join our engineering team at Plenful and take ownership of the reliability and performance of the systems...  ...will influence how we build, scale and operate our platform as we...  ...What you’ll do Reliability, Observability and Performance: Maintain and... 
    Work at office
    Remote work
    Flexible hours
    2 days per week

    Plenful

    San Francisco, CA
    2 days ago
  • About the Team The Infrastructure Engineering function sits within IT and is responsible for reliably building, deploying, and...  ...operational leverage as OpenAI scales. About the Role We are looking...  ...Azure management patterns. Build observability, alerting, and incident... 
    Work at office

    The Consulting Solutions

    San Francisco, CA
    4 days ago
  • $205k - $305k

     ...Director Of Site Reliability Engineering Interested in working on cutting-edge blockchain technology...  ...source platform that operates at high-scale today. Developers and companies around...  ...SDF engineering teams build, deploy, observe, and operate software with confidence.... 
    Temporary work
    Work at office
    Local area
    Worldwide
    Flexible hours

    Stellar

    San Francisco, CA
    2 days ago
  •  ...experienced Infrastructure Tech Lead to oversee the scaling of its platform. You will enhance infrastructure reliability and performance as customer demand grows,...  ...operational maturity and infrastructure-heavy engineering. Offerings include competitive salary, equity,... 
    Flexible hours

    LIGHTFIELD INC

    San Francisco, CA
    3 days ago
  • A leading streaming platform is looking for a Staff DevOps Engineer in San Francisco, CA, to automate and scale systems supporting their streaming services. The role involves leading projects on infrastructure automation, best-practice adoption, and collaboration across... 

    Crunchyroll

    San Francisco, CA
    3 days ago
  • $195k - $235k

    Crusoe Energy Systems LLC is looking for a Staff Network Operations Engineer to ensure production reliability across its global network infrastructure. This role is critical in maintaining uptime and facilitating AI workloads via incident response and operational excellence... 

    Crusoe Energy Systems LLC

    San Francisco, CA
    5 days ago
  • Slope is seeking an experienced Security Reliability Engineer to design and operate secure, scalable infrastructure in San Francisco. This role involves owning critical infrastructure systems from architecture through operation, requiring a hands-on engineer who excels... 

    Slope

    San Francisco, CA
    2 days ago
  • $10k

     ...error budgets, and builds the reliability culture from scratch. This...  ...postmortem discipline at scale on a real oncall rotation....  ..., not just scripts), Bash. Observability: Chronosphere, Prometheus, Grafana...  ...or a FAANG SRE / Production Engineer (Google, Uber, Twitter/X,... 
    Flexible hours

    Slope

    San Francisco, CA
    3 days ago
  • $300k

    Albert Bow is seeking a Founding Engineer to design and scale their distributed systems for autonomous AI agents. With a salary of up to $300,0...  ...architecting core infrastructure and ensuring production reliability while requiring 5+ years in early-stage product development... 

    Albert Bow

    San Francisco, CA
    4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to RL Infra Engineer - Reliability, Observability & Scale. Be the first to apply!