RL Infra Engineer - Reliability, Observability & Scale
United States Digital Space LLC
United States Digital Space LLC in San Francisco is seeking a skilled Research Engineer to ensure the reliability and infrastructure integrity of AI training environments. This role emphasizes proactive reliability measures and critical evaluation metrics. The successful candidate will possess strong Python skills and a background in operating ML systems, coupled with the ability to enhance observability across training environments. The role involves working closely with researchers to maintain system stability as demand scales. #J-18808-Ljbffr United States Digital Space LLC
- A leading AI research company in San Francisco is seeking a Software Engineer to enhance infrastructure supporting cutting-edge AI systems. The role involves designing reliable systems and optimizing performance for millions of users. Ideal candidates possess experience...Suggested
$350k
Mirendil is looking for engineers to build infrastructure for frontier reasoning models... .... This role focuses on large-scale reinforcement learning (RL) model training and requires a solid... ...principles. The ideal candidate will design reliable training infrastructure, implement...Suggested- ...in San Francisco seeks infrastructure engineers to enhance the tooling and systems... ...include building GPU orchestration, scaling cloud batchjob systems, and designing... ...infrastructure and a strong focus on reliability and observability. This position is in-person, and international...SuggestedVisa sponsorship
- A leading AI research company based in San Francisco is seeking experienced reliability engineers to scale their infrastructure and ensure system performance and reliability. This role involves collaborating with diverse teams to develop resilient systems and enhance operations...Suggested
$225k
...Manufacturing Co is looking for a Software Engineer on the Inference & RL Systems team in San Francisco. The... ...performance, and ensuring high reliability for RL and post-training workflows.... ...fundamentals and experience with large-scale systems. Compensation includes a...Suggested- A dynamic tech firm located in San Francisco is seeking a Site Reliability Engineer to enhance operational health across their production systems. This high-impact role demands expertise in AWS and strong programming skills. You will manage production systems' reliability...
- Happyrobot Inc. is looking for an Infrastructure Engineer in San Francisco, California. This role involves leading the stability and observability of systems while debugging complex issues as they arise. Candidates should have over 3 years of experience with production...
$300k
Aionia Group in San Francisco is seeking a Systems Infrastructure Engineer to build scalable infrastructure for RL experiments. This role offers a unique opportunity to work on innovative projects with leading researchers in a well-funded AI company. The ideal candidate...- ...actively hiring as we continue to scale. About the role We're hiring a Site Reliability Engineer (SRE) to ensure the... ...MTTR (Mean Time to Recovery) Observability & System Insight Design... ...Analytics: Amplitude ML/Platform Infra: TrueFoundry What Success...Work at officeRemote workFlexible hours2 days per week
$261k - $326k
...specializing in AI infrastructure is seeking a Principal Engineer to enhance reliability and scalability of cloud systems. This role demands over 1... ...networking expertise and systems fundamentals, especially in high-scale environments. Competitive compensation includes a salary...$266k - $398k
...Director, Site Reliability Engineering – Infrastructure Platform Okta is... ...to help us to continue to scale the service with great people... ...You’ll Be Doing Lead the infra platform and shared services... .... Build a world‑class observability platform and monitoring capabilities...Permanent employmentFlexible hours- Algora Public Benefit Corporation is looking for an AI Cloud Infra Engineer to join their team in San Francisco. You will ensure the reliability of backend systems and work closely with engineers to plan for future growth. The ideal candidate has strong cloud infrastructure...
$147k - $202k
...Overview: We are seeking a highly technical Staff Observability Site Reliability Engineer with a specialty in Splunk to own and evolve our Splunk... ...: Eliminate "toil" by automating the deployment and scaling of observability agents and collectors. Required Skills...Permanent employmentWork at officeLocal areaWorldwideFlexible hours- Elea Ecuador seeks experienced engineers for its San Francisco HQ, focusing on scalability and reliability of systems. Candidates will design solutions for infrastructure... ...Science and proven experience in a rapidly scaling environment, with strong skills in cloud infrastructure...WorldwideRelocation package
- ...looking for a world-class Site Reliability Engineer to ensure the reliability,... ...that power agentic AI at scale. Your mission: keep our... ...reliability posture end-to-end—observability, performance tuning,... ...closely with the founders, the infra team, and the dev team—and...
- ...leading language learning platform is seeking an experienced SRE Engineer to ensure the reliability and resilience of their infrastructure. Responsibilities include leading incident response, improving observability, and collaborating with various teams to enhance platform...
- Founding Platform & Reliability Engineer About OpenArt OpenArt is an AI Storytelling... ...not slices. Ship at real scale, your work goes to millions... ...hands-on implementation, observability, and cost optimization.... ...tradeoffs to non-infra peers clearly Ability to operate...Remote workWorldwideVisa sponsorship
- ...looking for an Infrastructure Engineer to take the lead on scaling our operational... ...You’ll own the stability, observability, and debugging workflows... ...role where you’ll shape how reliability is done - reducing incident... ...deployment pipelines, CI/CD, or infra-as-code Experience...WorldwideShift work
- We’re looking for a Systems Reliability Engineer to own the reliability of our system... ...responsible for making systems observable, diagnosable, and repeatable as we scale across deployments. You’ll work... ...multiple layers of the stack (infra > services > network)...Permanent employment
- ...developer or data scientist can scale an ML application from... ...looking for a Senior Site Reliability Engineer to join the Infrastructure... ...your laptop. As part of the Infra team, we build the scalable... ...performance, scalability, and observability of Anyscale-managed Ray...
$125k - $195k
...team of exceptional, hands-on engineers to make this happen.... ...seeking an Infrastructure & Site Reliability Engineer to design, build,... .... Our philosophy towards infra is minimal, understandable,... ...compatible storage, VPNs Scale our observability platform: Build systems to...Work at officeVisa sponsorshipNight shift- ...’re hiring an SRE to join our engineering team at Plenful and take ownership of the reliability and performance of the systems... ...will influence how we build, scale and operate our platform as we... ...What you’ll do Reliability, Observability and Performance: Maintain and...Work at officeRemote workFlexible hours2 days per week
- About the Team The Infrastructure Engineering function sits within IT and is responsible for reliably building, deploying, and... ...operational leverage as OpenAI scales. About the Role We are looking... ...Azure management patterns. Build observability, alerting, and incident...Work at office
$205k - $305k
...Director Of Site Reliability Engineering Interested in working on cutting-edge blockchain technology... ...source platform that operates at high-scale today. Developers and companies around... ...SDF engineering teams build, deploy, observe, and operate software with confidence....Temporary workWork at officeLocal areaWorldwideFlexible hours- ...experienced Infrastructure Tech Lead to oversee the scaling of its platform. You will enhance infrastructure reliability and performance as customer demand grows,... ...operational maturity and infrastructure-heavy engineering. Offerings include competitive salary, equity,...Flexible hours
- A leading streaming platform is looking for a Staff DevOps Engineer in San Francisco, CA, to automate and scale systems supporting their streaming services. The role involves leading projects on infrastructure automation, best-practice adoption, and collaboration across...
$195k - $235k
Crusoe Energy Systems LLC is looking for a Staff Network Operations Engineer to ensure production reliability across its global network infrastructure. This role is critical in maintaining uptime and facilitating AI workloads via incident response and operational excellence...- Slope is seeking an experienced Security Reliability Engineer to design and operate secure, scalable infrastructure in San Francisco. This role involves owning critical infrastructure systems from architecture through operation, requiring a hands-on engineer who excels...
$10k
...error budgets, and builds the reliability culture from scratch. This... ...postmortem discipline at scale on a real oncall rotation.... ..., not just scripts), Bash. Observability: Chronosphere, Prometheus, Grafana... ...or a FAANG SRE / Production Engineer (Google, Uber, Twitter/X,...Flexible hours$300k
Albert Bow is seeking a Founding Engineer to design and scale their distributed systems for autonomous AI agents. With a salary of up to $300,0... ...architecting core infrastructure and ensuring production reliability while requiring 5+ years in early-stage product development...
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to RL Infra Engineer - Reliability, Observability & Scale. Be the first to apply!


