Network Engineer, Reliability & Observability

$150k - $250k

FluidStack

About Fluidstack

We exist to make humanity more free. For most of human history, you farmed or you starved. Technology gave people more time for the things they wanted to do, instead of things they had to do. Powerful AI will be the biggest lever for human choice we've ever built - but only if models are aligned with what humanity actually wants. There are groups building AI who don't share these goals. Whoever deploys frontier compute infrastructure fastest will decide whether AI expands human freedom or shrinks it.
We're singularly focused on delivering 10 to 100s of GWs of compute faster than anyone else, rethinking every layer of the stack. We acquire power, design and build data centers, and operate them - with teams spanning hardware and software. Speed and scale are our key differentiators. Come be a part of building civilization-scale infrastructure for AI.
We hire people who care deeply about this problem space. If that is you, please apply!

About the Role

Fluidstack is seeking a Network Engineer, Reliability & Observability to serve as a reliability engineer championing and building process, data collections, and reliability metrics with the objective of improving the quality and reliability of AI networks from deployment through the full lifecycle of operations.

This role is focused on developing processes, systems, tools, data and data pipelines, and observability to improve the quality of networks and deliver automated metrics (24x7) as well as periodic reliability reports for both internal and external customers.

This role is ideal for experienced network operators who are passionate about reliability and have experience designing and building full lifecycle software such as Quality Assurance audits, circuit audits, periodic audits, failure rates and failure analysis. You are passionate about hardware (electronics and optics), software development, and you value and promote the use of data to make informed decisions in deployment, operations, and strategic sourcing.

Experienced SRE (Site Reliability Engineers) with a passion for networking are encouraged to apply.

Focus

Ownership of Quality Assurance: Design, develop, and support QA process for network hardware and networks.
Pipelines: Develop and deploy serverless workflows, server based, and manually triggered data pipelines producing network quality and reliability observability for internal and external customers.
Deployment and Operations Support : Support full lifecycle data collection and analysis partnering with Deployment, Operations, DC hardware, and logistics teams to produce data that drives process improvements and delivers on SLA and SLOs.
Process Engineering: Develop, pilot, and deploy process improvements for deployment and repair to produce data and consume data with Machine Learning to fulfill our mission.
Cross-Team Collaboration: Own without ego and execute in a collaborative team with design, deployment, operations engineers and software developers.
Subject Matter Expert: In at least two or more deep subjects such as IP routing, optics, optical transport, Ethernet, RDMA/RoCE, or electrical power.

About You

Strong Operations Background: 5+ years in network engineering and at least 3+ years in operations with significant hands-on operational experience. You've run production networks or compute, responded to incidents at all hours, and debugged complex failures under pressure. You understand the difference between "working" and "production-ready".
Software Development: You have experience with ITIL, Agile (xP), and TDD including developing and leading programs and projects. You have experience building hyperscale platforms, demonstrating a fluency in Golang with supporting tools in Python or RUST.
Datacenter Fabric Expertise: Deep experience operating modern datacenter networks including EVPN/VXLAN, BGP, CLOS topologies, and high-radix switching. You're comfortable troubleshooting Layer 2/3 issues, BGP routing problems, fabric misconfigurations, and physical media failures..
Incident Response Excellence: Proven ability to lead incident response, perform systematic troubleshooting, and drive issues to resolution. You remain calm during outages, communicate clearly with stakeholders, and know when to escalate versus when to dig deeper. You've been the person others call when things break.
Matrix Leadership Experience: You understand how to build relationships with onsite teams, coordinate physical infrastructure work, and represent network engineering in a field environment. You know how to get things done in operational settings with many internal and external teams and stakeholders.
Operational Pragmatism: You balance perfection with progress. You can troubleshoot with imperfect information, make pragmatic decisions under time pressure, and prioritize based on business impact. You document as you go and continuously improve operational processes.
Self Driven: You embrace complex challenges with undefined process and key results. You can dive in to learn, but zoom back out to build Objectives, develop Key Results and build a software development project and pipeline in Jira solo. You can then switch hats and begin coding.
Travel: You are willing and able to travel to spend time with the team at our local offices or data center locations, up to 20% of the time.

Nice to Haves

AI/HPC Fabric Operations: Experience operating AI/ML or HPC fabrics with RDMA (RoCEv2), lossless Ethernet (PFC, ECN), or high-performance networking. You understand the operational precision required when network performance directly impacts workload completion.
Reliability Engineering: You have experience with observability and reliability engineering from network operations or in manufacturing quality.
Hardware Repair Experience: Hands-on experience coordinating hardware repairs, RMAs, and physical infrastructure work. You understand datacenter logistics, vendor escalation processes, and how to work effectively with onsite technicians.
Observability & Monitoring: Familiarity with network monitoring platforms, alerting systems, and telemetry collection. You've used monitoring tools to diagnose issues proactively and tune alerting to reduce noise. You have experience with SQL, MySQL, and building operations dashboards.

Salary & Benefits

Competitive total compensation package (salary + equity).
Retirement or pension plan, in line with local norms.
Health, dental, and vision insurance.
Generous PTO policy, in line with local norms.

The base salary range for this position is $150,000 - $250,000 per year, depending on experience, skills, qualifications, and location. This range represents our good faith estimate of the compensation for this role at the time of posting. Total compensation may also include equity in the form of stock options.

We are committed to pay equity and transparency.

Fluidstack is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans' status, or any other characteristic protected by law. Fluidstack will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.

You will receive a confirmation email once your application has successfully been accepted. If there is an error with your submission and you did not receive a confirmation email, please email View email address on click.appcast.io with your resume/CV, the role you've applied for, and the date you submitted your application-- someone from our recruiting team will be in touch.

Apply

Vacancy posted 5 days ago

Similar jobs that could be interesting for youBased on the Network Engineer, Reliability & Observability in San Francisco, CA vacancy

Site Reliability Engineer — Observability & Automation
$230k - $310k
A tech company is seeking an experienced Site Reliability Engineer to ensure the reliability and performance of its production systems across AWS infrastructure. You will build observability tools, lead incident responses, and collaborate on architectural improvements....
Suggested
Gamma
San Francisco, CA
3 days ago
Senior Site Reliability Engineer - Observability
...home day is currently Tuesday. Engineering at Lambda is responsible for... ...’ll Do Deploy and operate observability platforms for logging,... ...adoptable and improve product reliability. Lead members of other engineering... ...monitoring or network monitoring Experience with Prometheus...
Suggested
Work at office
Local area
Work from home
Lambda
San Francisco, CA
1 day ago
Remote Senior Site Reliability Engineer — Observability & Resilience
Fieldguide is seeking a Senior Site Reliability Engineer to ensure the reliability and scalability of our production systems in San Francisco... ...teams to define reliability standards and build robust observability practices. Candidates should have at least 5 years of...
Suggested
Remote job
Flexible hours
Fieldguide
San Francisco, CA
5 days ago
Senior Network Reliability Engineer
...Job Description Insight Global is seeking a Network Engineer – Reliability & Observability to support the quality, reliability, and lifecycle performance of large-scale AI network infrastructure. This role serves as a reliability engineering leader, responsible for...
Suggested
Insight Global
San Francisco, CA
5 days ago
Remote Site Reliability Engineer: Scale & Observability
$175k - $250k
I did my part and supported the Regular Toilet is seeking a Site Reliability Engineer to enhance the reliability and performance of our systems at WorkOS. As a key member of the SRE team, you will handle critical responsibilities like improving incident responses and collaborating...
Suggested
Remote job
Flexible hours
I did my part and supported the Regular Toilet
San Francisco, CA
2 days ago
Site Reliability Engineer - Scale & Observability
A dynamic tech firm located in San Francisco is seeking a Site Reliability Engineer to enhance operational health across their production systems. This high-impact role demands expertise in AWS and strong programming skills. You will manage production systems' reliability...
gamma.app
San Francisco, CA
5 days ago
Staff Site Reliability Engineer - Observability
$147k - $202k
...Overview: We are seeking a highly technical Staff Observability Site Reliability Engineer with a specialty in Splunk to own and evolve our... ...Distributed Systems: Deep understanding of Linux internals, networking (TCP/IP, DNS, Load Balancing), and container...
Permanent employment
Work at office
Local area
Worldwide
Flexible hours
Okta
San Francisco, CA
a month ago
Platform Engineer: ML Infra, Reliability & Observability
Zyphra in San Francisco is hiring a Platform Engineer responsible for designing and maintaining robust infrastructure. You will collaborate with teams to enhance system observability, manage cloud environments and ensure deployment safety. The ideal candidate has strong...
Zyphra
San Francisco, CA
4 days ago
Senior Manager, Site Reliability Engineering - Infrastructure Platform
$232k - $319k
...the service with great people and reliable, cost-effective, and efficient... ...oversee multiple teams focused on Edge networking, K8s platform, CI/CD, Observability, automation platform & tooling.... ...with architects and product engineering Build a world-class observability...
Permanent employment
Local area
Worldwide
Flexible hours
Okta, Inc.
San Francisco, CA
2 days ago
Senior Manager, Site Reliability Engineering
$227.2k - $324.5k
...About the Role: Site Reliability Engineering (SRE) at Tubi is not a traditional operations... ...technical strategy and vision for Tubi's observability, and automation platforms. Partner... ...of AWS services (especially networking, IAM, EKS, ALBs/NLBs, Route 53, CloudWatch...
Full time
Contract work
Temporary work
Local area
Flexible hours
Tubi
San Francisco, CA
1 day ago
Reliability Engineer: Scale Systems, Observe & Automate
A leading AI research company based in San Francisco is seeking experienced reliability engineers to scale their infrastructure and ensure system performance and reliability. This role involves collaborating with diverse teams to develop resilient systems and enhance operations...
OpenAI
San Francisco, CA
2 days ago
Infra Reliability Engineer: Scale, Observability & Security
A leading AI research company in San Francisco is seeking a Software Engineer to enhance infrastructure supporting cutting-edge AI systems. The role involves designing reliable systems and optimizing performance for millions of users. Ideal candidates possess experience...
OpenAI
San Francisco, CA
3 days ago
Senior Site Reliability Engineer, Fleet Management
...The TeamPlatform Engineering is the department within SRE that is... ...Kubernetes infrastructure, networking, load balancing (including... ...internal service mesh), and observability and alerting systems.The Fleet... ...that ensure cluster reliability and security (e.g., CoreDNS...
Work at office
Local area
Remote work
Worldwide
Flexible hours
MongoDB
San Francisco, CA
1 day ago
Senior Technology Site Reliability Engineer
$140k - $205k
...Senior Technology Site Reliability Engineer Cooley is seeking a Senior Site Reliability Engineer to join the Infrastructure & Development... ...to build and maintain automated, resilient, and observable systems that support high availability and operational excellence...
Full time
Temporary work
Work at office
Flexible hours
Weekend work
Cooley
San Francisco, CA
1 day ago
Senior Site Reliability Engineer - AI Infrastructure
Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco ·... ...ve been quietly building the systems, network, and orchestration layer that makes the... ...that degrade collective operations. Observability: Build deep visibility into GPU...
Full time
Remote work
Andromeda
San Francisco, CA
2 days ago
Principal Site Reliability Engineer
$300 per month
...On-site Department Cloud Engineering Crusoe's mission is to accelerate... ...Role As a Principal Site Reliability Engineer, you will play a... ...Architect and improve observability systems (metrics, logs, tracing... ...with Infrastructure, Networking, Hardware, and Platform teams...
Full time
Temporary work
Epoch Biodesign
San Francisco, CA
2 days ago
Senior / Staff Site Reliability, Platform Engineering
...cloud‑native systems. As a Staff Platform Engineer, you will play a critical role in... ...technical leadership role. You will own reliability for major platform domains, design scalable... .... Establish and enhance centralized Observability and Monitoring platforms and tools that...
Saviynt
San Francisco, CA
2 days ago
Site Reliability Engineer - AI Infrastructure
Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco · Full-Time... ...been quietly building the systems, network, and orchestration layer that makes the... ...implement monitoring, alerting, and observability for critical systems. Collaborate...
Full time
Remote work
Andromeda Cluster
San Francisco, CA
4 days ago
Manager, Site Reliability Engineering - Fleet Management
$151k - $297k
The Team Platform Engineering is the department within SRE that is... ...Kubernetes infrastructure, networking, load balancing (including... ...internal service mesh), and observability and alerting systems. The... ...components that ensure cluster reliability and security (e.g., CoreDNS...
Local area
Immediate start
Remote work
Flexible hours
Shift work
MongoDB
San Francisco, CA
1 day ago
Site Reliability Engineer - AI Infrastructure
$250k
...platform spanning infrastructure, networking, and orchestration.... ...Kubernetes environments. Develop observability, alerting, and auto-healing... ...code, CI/CD pipelines, and reliability standards across thousands... ...DevOps, or Infrastructure Engineering roles supporting large-...
Immediate start
Hamilton Barnes Associates Limited
San Francisco, CA
5 days ago
Security Reliability Engineer
$293k - $385k
...Team The Infrastructure Engineering function sits within IT and is responsible for reliably building, deploying, and operating... ...IT, Security, Identity, and Network teams to ensure infrastructure... ...Ensure automation is safe, observable, and resilient under failure conditions...
Work at office
OpenAI
San Francisco, CA
3 days ago
Founding Security Reliability Engineer
$150k - $250k
...As our Founding Security Reliability Engineer at Charta Health, you'll pioneer the application... ...(primarily AWS), including network security, identity and access management... ...secrets management solutions. Security Observability & Monitoring: Establish comprehensive...
Charta Health
San Francisco, CA
4 days ago
Staff Software Engineer, Site Reliability Engineer
$238k - $290k
...Role Overview As a Staff Software Engineer on the Site Reliability team at Harvey, you will ensure the... ...resources (compute, storage, networking) across 50+ global regions Lead... ...CloudFormation, etc.) ~ Deep familiarity with observability tools (Datadog, Sentry, etc.) and...
Relocation package
Harvey
San Francisco, CA
2 days ago
Reliability Engineer: Cloud, Edge & On-site Deployments
$150k - $170k
Claryo, Inc. is seeking an Integration Reliability Engineer in San Francisco, CA, responsible for... ...candidate will build and maintain observability tools and improve incident response processes... ...experience in SRE, strong Linux and networking skills, and familiarity with...
Claryo, Inc.
San Francisco, CA
4 days ago
Software Engineering SMTS - Cloud Reliability
$148.5k - $223.9k
...Senior Member of Technical Staff (SMTS) - Site Reliability Engineer (Cloud Automation) Location: New York, NY; San Francisco, CA About... ..., and maintaining a 99.999% availability standard. Observability: Relying on telemetry, centralized logging, and ChatOps to...
Work experience placement
Shift work
Salesforce
San Francisco, CA
1 day ago
Site Reliability Engineer
...re hiring an SRE to join our engineering team at Plenful and take ownership of the reliability and performance of the systems... ...What you’ll do Reliability, Observability and Performance: Maintain... ...resource usage across compute, networking and storage. Security, Compliance...
Work at office
Remote work
Flexible hours
2 days per week
Plenful
San Francisco, CA
3 days ago
Senior Site Reliability Engineer
...Connor was a machine learning research engineer at Scale AI. The rest of our team comes... ...Senior SRE, you'll tackle the scaling and reliability challenges that come with adding... ...services, and building the automation and observability that keep Unify fast and reliable at scale...
Unify
San Francisco, CA
2 days ago
Site Reliability Engineer
$125k - $165k
Position Site Reliability Engineer Location Lincoln, NE, San Francisco, CA, or Remote Job ID 434 Openings 1 Job Summary The Site Reliability... ...2+ years of experience with Terraform Experience with observability Insurance & 401(k) Group insurance package covering...
Temporary work
Remote work
Visa sponsorship
Work visa
Flexible hours
TELCOR Inc
San Francisco, CA
5 days ago
Sr. Site Reliability Engineer
$163k - $203k
...on the SRE team, responsible for the reliability, scalability, and security of Prosper’... ...portfolio. This is as much a platform engineering role as it is an SRE role— you will maintain... ...reliability, CI/CD pipelines, and observability while simultaneously building the...
Work experience placement
Work at office
Remote work
Flexible hours
2 days per week
GoTo Meeting
San Francisco, CA
2 days ago
Senior SRE Engineer: Scale & Reliability (Kubernetes/GCP)
...leading language learning platform is seeking an experienced SRE Engineer to ensure the reliability and resilience of their infrastructure. Responsibilities include leading incident response, improving observability, and collaborating with various teams to enhance platform...
Speak
San Francisco, CA
5 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Network Engineer, Reliability & Observability. Be the first to apply!