RL Infra Engineer - Reliability, Observability & Scale

United States Digital Space LLC

United States Digital Space LLC in San Francisco is seeking a skilled Research Engineer to ensure the reliability and infrastructure integrity of AI training environments. This role emphasizes proactive reliability measures and critical evaluation metrics. The successful candidate will possess strong Python skills and a background in operating ML systems, coupled with the ability to enhance observability across training environments. The role involves working closely with researchers to maintain system stability as demand scales. #J-18808-Ljbffr United States Digital Space LLC

Apply

Vacancy posted 5 days ago

Similar jobs that could be interesting for youBased on the RL Infra Engineer - Reliability, Observability & Scale in San Francisco, CA vacancy

Infra Reliability Engineer: Scale, Observability & Security
A leading AI research company in San Francisco is seeking a Software Engineer to enhance infrastructure supporting cutting-edge AI systems. The role involves designing reliable systems and optimizing performance for millions of users. Ideal candidates possess experience...
Suggested
OpenAI
San Francisco, CA
5 days ago
Senior RL Infra Engineer — Frontier AI Training & Scale
$350k
Mirendil is looking for engineers to build infrastructure for frontier reasoning models... .... This role focuses on large-scale reinforcement learning (RL) model training and requires a solid... ...principles. The ideal candidate will design reliable training infrastructure, implement...
Suggested
Mirendil
San Francisco, CA
1 day ago
GPU Infra Engineer: Scale Massive Clusters & Observability
...in San Francisco seeks infrastructure engineers to enhance the tooling and systems... ...include building GPU orchestration, scaling cloud batchjob systems, and designing... ...infrastructure and a strong focus on reliability and observability. This position is in-person, and international...
Suggested
Visa sponsorship
Exa
San Francisco, CA
1 day ago
Reliability Engineer: Scale Systems, Observe & Automate
A leading AI research company based in San Francisco is seeking experienced reliability engineers to scale their infrastructure and ensure system performance and reliability. This role involves collaborating with diverse teams to develop resilient systems and enhance operations...
Suggested
OpenAI
San Francisco, CA
4 days ago
Staff Engineer, Inference & RL Systems — Scale Production ML
$225k
...Manufacturing Co is looking for a Software Engineer on the Inference & RL Systems team in San Francisco. The... ...performance, and ensuring high reliability for RL and post-training workflows.... ...fundamentals and experience with large-scale systems. Compensation includes a...
Suggested
Dormont Manufacturing Co
San Francisco, CA
5 days ago
Site Reliability Engineer - Scale & Observability
A dynamic tech firm located in San Francisco is seeking a Site Reliability Engineer to enhance operational health across their production systems. This high-impact role demands expertise in AWS and strong programming skills. You will manage production systems' reliability...
gamma.app
San Francisco, CA
2 days ago
Site Reliability Engineer — Scale AI Infra with Ownership
Happyrobot Inc. is looking for an Infrastructure Engineer in San Francisco, California. This role involves leading the stability and observability of systems while debugging complex issues as they arise. Candidates should have over 3 years of experience with production...
Happyrobot Inc.
San Francisco, CA
4 days ago
RL Infra Engineer: Scale GPU RL Experiments (Equity)
$300k
Aionia Group in San Francisco is seeking a Systems Infrastructure Engineer to build scalable infrastructure for RL experiments. This role offers a unique opportunity to work on innovative projects with leading researchers in a well-funded AI company. The ideal candidate...
Aionia Group
San Francisco, CA
4 days ago
Site Reliability Engineer
...actively hiring as we continue to scale. About the role We're hiring a Site Reliability Engineer (SRE) to ensure the... ...MTTR (Mean Time to Recovery) Observability & System Insight Design... ...Analytics: Amplitude ML/Platform Infra: TrueFoundry What Success...
Work at office
Remote work
Flexible hours
2 days per week
Plenful
San Francisco, CA
3 days ago
Senior Principal Cloud Infra Reliability Engineer
$261k - $326k
...specializing in AI infrastructure is seeking a Principal Engineer to enhance reliability and scalability of cloud systems. This role demands over 1... ...networking expertise and systems fundamentals, especially in high-scale environments. Competitive compensation includes a salary...
Crusoe
San Francisco, CA
4 days ago
Senior Manager, Site Reliability Engineering - Infrastructure Platform
$266k - $398k
...Director, Site Reliability Engineering – Infrastructure Platform Okta is... ...to help us to continue to scale the service with great people... ...You’ll Be Doing Lead the infra platform and shared services... .... Build a world‑class observability platform and monitoring capabilities...
Permanent employment
Flexible hours
Okta, Inc.
San Francisco, CA
5 days ago
Senior AI Cloud Infra Engineer: Scale & Reliability
Algora Public Benefit Corporation is looking for an AI Cloud Infra Engineer to join their team in San Francisco. You will ensure the reliability of backend systems and work closely with engineers to plan for future growth. The ideal candidate has strong cloud infrastructure...
Algora Public Benefit Corporation
San Francisco, CA
5 days ago
Staff Site Reliability Engineer - Observability
$147k - $202k
...Overview: We are seeking a highly technical Staff Observability Site Reliability Engineer with a specialty in Splunk to own and evolve our Splunk... ...: Eliminate "toil" by automating the deployment and scaling of observability agents and collectors. Required Skills...
Permanent employment
Work at office
Local area
Worldwide
Flexible hours
Okta
San Francisco, CA
a month ago
Software Reliability Engineer - Scale & Resilience
Elea Ecuador seeks experienced engineers for its San Francisco HQ, focusing on scalability and reliability of systems. Candidates will design solutions for infrastructure... ...Science and proven experience in a rapidly scaling environment, with strong skills in cloud infrastructure...
Worldwide
Relocation package
Elea Ecuador
San Francisco, CA
1 day ago
Site Reliability Engineer
...looking for a world-class Site Reliability Engineer to ensure the reliability,... ...that power agentic AI at scale. Your mission: keep our... ...reliability posture end-to-end—observability, performance tuning,... ...closely with the founders, the infra team, and the dev team—and...
Blaxel
San Francisco, CA
5 days ago
Senior SRE Engineer: Scale & Reliability (Kubernetes/GCP)
...leading language learning platform is seeking an experienced SRE Engineer to ensure the reliability and resilience of their infrastructure. Responsibilities include leading incident response, improving observability, and collaborating with various teams to enhance platform...
Speak
San Francisco, CA
2 days ago
Founding Platform & Reliability Engineer
Founding Platform & Reliability Engineer About OpenArt OpenArt is an AI Storytelling... ...not slices. Ship at real scale, your work goes to millions... ...hands-on implementation, observability, and cost optimization.... ...tradeoffs to non-infra peers clearly Ability to operate...
Remote work
Worldwide
Visa sponsorship
Embedding VC
San Francisco, CA
1 day ago
Site Reliability Engineer
...looking for an Infrastructure Engineer to take the lead on scaling our operational... ...You’ll own the stability, observability, and debugging workflows... ...role where you’ll shape how reliability is done - reducing incident... ...deployment pipelines, CI/CD, or infra-as-code Experience...
Worldwide
Shift work
Happyrobot Inc.
San Francisco, CA
4 days ago
System Reliability Engineering
We’re looking for a Systems Reliability Engineer to own the reliability of our system... ...responsible for making systems observable, diagnosable, and repeatable as we scale across deployments. You’ll work... ...multiple layers of the stack (infra > services > network)...
Permanent employment
Claryo
San Francisco, CA
2 days ago
Senior Site Reliability Engineer, Platform Infrastructure (Foundations)
...developer or data scientist can scale an ML application from... ...looking for a Senior Site Reliability Engineer to join the Infrastructure... ...your laptop. As part of the Infra team, we build the scalable... ...performance, scalability, and observability of Anyscale-managed Ray...
Anyscale
San Francisco, CA
3 days ago
Infrastructure & Site Reliability Engineer
$125k - $195k
...team of exceptional, hands-on engineers to make this happen.... ...seeking an Infrastructure & Site Reliability Engineer to design, build,... .... Our philosophy towards infra is minimal, understandable,... ...compatible storage, VPNs Scale our observability platform: Build systems to...
Work at office
Visa sponsorship
Night shift
Atomicsemi
San Francisco, CA
1 day ago
Site Reliability Engineer
...’re hiring an SRE to join our engineering team at Plenful and take ownership of the reliability and performance of the systems... ...will influence how we build, scale and operate our platform as we... ...What you’ll do Reliability, Observability and Performance: Maintain and...
Work at office
Remote work
Flexible hours
2 days per week
Plenful
San Francisco, CA
2 days ago
Staff Security Reliability Engineer
About the Team The Infrastructure Engineering function sits within IT and is responsible for reliably building, deploying, and... ...operational leverage as OpenAI scales. About the Role We are looking... ...Azure management patterns. Build observability, alerting, and incident...
Work at office
The Consulting Solutions
San Francisco, CA
4 days ago
Director, Site Reliability Engineering
$205k - $305k
...Director Of Site Reliability Engineering Interested in working on cutting-edge blockchain technology... ...source platform that operates at high-scale today. Developers and companies around... ...SDF engineering teams build, deploy, observe, and operate software with confidence....
Temporary work
Work at office
Local area
Worldwide
Flexible hours
Stellar
San Francisco, CA
2 days ago
Infra Tech Lead: Scale & Postgres Reliability (Equity)
...experienced Infrastructure Tech Lead to oversee the scaling of its platform. You will enhance infrastructure reliability and performance as customer demand grows,... ...operational maturity and infrastructure-heavy engineering. Offerings include competitive salary, equity,...
Flexible hours
LIGHTFIELD INC
San Francisco, CA
3 days ago
Staff DevOps Engineer - Scale Automation & Reliability
A leading streaming platform is looking for a Staff DevOps Engineer in San Francisco, CA, to automate and scale systems supporting their streaming services. The role involves leading projects on infrastructure automation, best-practice adoption, and collaboration across...
Crunchyroll
San Francisco, CA
3 days ago
Staff Network Reliability Engineer - Scale & Incident Response
$195k - $235k
Crusoe Energy Systems LLC is looking for a Staff Network Operations Engineer to ensure production reliability across its global network infrastructure. This role is critical in maintaining uptime and facilitating AI workloads via incident response and operational excellence...
Crusoe Energy Systems LLC
San Francisco, CA
5 days ago
Security Reliability Engineer: Build Safe, Scalable Infra
Slope is seeking an experienced Security Reliability Engineer to design and operate secure, scalable infrastructure in San Francisco. This role involves owning critical infrastructure systems from architecture through operation, requiring a hands-on engineer who excels...
Slope
San Francisco, CA
2 days ago
Member of Technical Staff, Site Reliablity Engineer
$10k
...error budgets, and builds the reliability culture from scratch. This... ...postmortem discipline at scale on a real oncall rotation.... ..., not just scripts), Bash. Observability: Chronosphere, Prometheus, Grafana... ...or a FAANG SRE / Production Engineer (Google, Uber, Twitter/X,...
Flexible hours
Slope
San Francisco, CA
3 days ago
Founding Engineer: Scale AI Infra & Orchestration Equity
$300k
Albert Bow is seeking a Founding Engineer to design and scale their distributed systems for autonomous AI agents. With a salary of up to $300,0... ...architecting core infrastructure and ensuring production reliability while requiring 5+ years in early-stage product development...
Albert Bow
San Francisco, CA
4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to RL Infra Engineer - Reliability, Observability & Scale. Be the first to apply!