Senior / Staff Network Reliability Engineer

$150k - $250k

Fluidstack

About Fluidstack We exist to make humanity more free. For most of human history, you farmed or you starved. Technology gave people more time for the things they wanted to do, instead of things they had to do. Powerful AI will be the biggest lever for human choice we've ever built - but only if models are aligned with what humanity actually wants. There are groups building AI who don't share these goals. Whoever deploys frontier compute infrastructure fastest will decide whether AI expands human freedom or shrinks it. We're singularly focused on delivering 10 to 100s of GWs of compute faster than anyone else, rethinking every layer of the stack. We acquire power, design and build data centers, and operate them - with teams spanning hardware and software. Speed and scale are our key differentiators. Come be a part of building civilization-scale infrastructure for AI. We hire people who care deeply about this problem space. If that is you, please apply! About the Role Fluidstack is seeking a Network Engineer, Reliability & Observability to serve as a reliability engineer championing and building process, data collections, and reliability metrics with the objective of improving the quality and reliability of AI networks from deployment through the full lifecycle of operations. This role is focused on developing processes, systems, tools, data and data pipelines, and observability to improve the quality of networks and deliver automated metrics (24x7) as well as periodic reliability reports for both internal and external customers. This role is ideal for experienced network operators who are passionate about reliability and have experience designing and building full lifecycle software such as Quality Assurance audits, circuit audits, periodic audits, failure rates and failure analysis. You are passionate about hardware (electronics and optics), software development, and you value and promote the use of data to make informed decisions in deployment, operations, and strategic sourcing. Experienced SRE (Site Reliability Engineers) with a passion for networking are encouraged to apply. Focus Ownership of Quality Assurance: Design, develop, and support QA process for network hardware and networks. Pipelines: Develop and deploy serverless workflows, server based, and manually triggered data pipelines producing network quality and reliability observability for internal and external customers. Deployment and Operations Support : Support full lifecycle data collection and analysis partnering with Deployment, Operations, DC hardware, and logistics teams to produce data that drives process improvements and delivers on SLA and SLOs. Process Engineering: Develop, pilot, and deploy process improvements for deployment and repair to produce data and consume data with Machine Learning to fulfill our mission. Cross-Team Collaboration: Own without ego and execute in a collaborative team with design, deployment, operations engineers and software developers. Subject Matter Expert: In at least two or more deep subjects such as IP routing, optics, optical transport, Ethernet, RDMA/RoCE, or electrical power. About You Strong Operations Background: 5+ years in network engineering and at least 3+ years in operations with significant hands‑on operational experience. You've run production networks or compute, responded to incidents at all hours, and debugged complex failures under pressure. You understand the difference between "working" and "production‑ready". Software Development: You have experience with ITIL, Agile (xP), and TDD including developing and leading programs and projects. You have experience building hyperscale platforms, demonstrating a fluency in Golang with supporting tools in Python or RUST. Datacenter Fabric Expertise: Deep experience operating modern datacenter networks including EVPN/VXLAN, BGP, CLOS topologies, and high‑radix switching. You're comfortable troubleshooting Layer 2/3 issues, BGP routing problems, fabric misconfigurations, and physical media failures. Incident Response Excellence: Proven ability to lead incident response, perform systematic troubleshooting, and drive issues to resolution. You remain calm during outages, communicate clearly with stakeholders, and know when to elevate versus when to dig deeper. You've been the person others call when things break. Matrix Leadership Experience: You understand how to build relationships with onsite teams, coordinate physical infrastructure work, and represent network engineering in a field environment. You know how to get things done in operational settings with many internal and external teams and stakeholders. Operational Pragmatism: You balance perfection with progress. You can troubleshoot with imperfect information, make pragmatic decisions under time pressure, and prioritize based on business impact. You document as you go and continuously improve operational processes. Self Driven: You embrace complex challenges with undefined process and key results. You can dive in to learn, but zoom back out to build Objectives, develop Key Results, and build a software development project and pipeline in Jira solo. You can then switch hats and begin coding. Travel: You are willing and able to travel to spend time with the team at our local offices or data center locations, up to 20% of the time. Nice to Haves AI/HPC Fabric Operations: Experience operating AI/ML or HPC fabrics with RDMA (RoCEv2), lossless Ethernet (PFC, ECN), or high‑performance networking. You understand the operational precision required when network performance directly impacts workload completion. Reliability Engineering: You have experience with observability and reliability engineering from network operations or in manufacturing quality. Hardware Repair Experience: Hands‑on experience coordinating hardware repairs, RMAs, and physical infrastructure work. You understand datacenter logistics, vendor escalation processes, and how to work effectively with onsite technicians. Observability & Monitoring: Familiarity with network monitoring platforms, alerting systems, and telemetry collection. You've used monitoring tools to diagnose issues proactively and tune alerting to reduce noise. You have experience with SQL, MySQL, and building operations dashboards. Salary & Benefits Competitive total compensation package (salary + equity). Retirement or pension plan, in line with local norms. Health, dental, and vision insurance. Generous PTO policy, in line with local norms. The base salary range for this position is $150,000 - $250,000 per year, depending on experience, skills, qualifications, and location. This range represents our good faith estimate of the compensation for this role at the time of posting. Total compensation may also include equity in the form of stock options. We are committed to pay equity and transparency. Fluidstack is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans’ status, or any other characteristic protected by law. Fluidstack will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law. #J-18808-Ljbffr Fluidstack

Apply

Vacancy posted 1 day ago

Similar jobs that could be interesting for youBased on the Senior / Staff Network Reliability Engineer in San Francisco, CA vacancy

Senior Network Reliability Engineer Low-Latency AI/ML Scale
...A leading tech company in San Francisco seeks a Senior / Staff Network Reliability Engineer to enhance and maintain a high-performance networking stack. Candidates should have deep expertise in Linux networking, with 7+ years of experience in network-heavy environments...
Senior
Fluidstack
San Francisco, CA
2 days ago
Senior Network Reliability Engineer - AI Ops + Equity
Fluidstack is seeking a Network Engineer, Reliability & Observability to ensure the reliability of AI networks through robust data collection and metrics reporting. This role involves developing processes and systems while collaborating with cross-functional teams. The...
Senior
Fluidstack
San Francisco, CA
1 day ago
Senior Staff Network Reliability Engineer - Global Edge
$225k - $275k
Crusoe Energy Systems LLC in San Francisco is looking for a Senior Staff Network Operations Engineer to ensure production reliability across its global network. In this role, you will lead incident response and define key operational standards. Ideal candidates will bring...
Senior
Crusoe Energy Systems LLC
San Francisco, CA
23 hours ago
Senior Principal Cloud Infra Reliability Engineer
$261k - $326k
...specializing in AI infrastructure is seeking a Principal Engineer to enhance reliability and scalability of cloud systems. This role demands over... ...operational excellence. Candidates should have strong networking expertise and systems fundamentals, especially in high-scale...
Senior
Crusoe
San Francisco, CA
1 day ago
Senior SRE Platform Engineer for AI-Powered Code Review
...An innovative R&D company in San Francisco is seeking a Site Reliability Engineer to join its Platform Engineering team. This position focuses on ensuring the reliability and performance of an AI-powered code review platform. The ideal candidate will have 6-8 years of...
Senior
CodeRabbit
San Francisco, CA
2 days ago
Senior Staff Cloud Reliability Engineer for AI Infra
Epoch Biodesign in San Francisco is seeking a Senior Staff Cloud Support Engineer to lead technical escalations and improve cloud infrastructure. You will mentor engineers and influence architectural decisions while ensuring high availability for AI workloads. The ideal...
Senior
Epoch Biodesign
San Francisco, CA
23 hours ago
Senior Cloud & Infra Reliability Engineer
A healthcare technology company seeks a Senior Technical Support Engineer to manage and resolve technical issues. You will play a critical role in supporting clients, ensuring seamless operations and efficient issue resolution. Responsibilities include managing the lifecycle...
Senior
Medasource
San Francisco, CA
1 day ago
Senior Site Reliability Engineer - AI Cloud & GPU Infra
A tech company focused on AI is seeking a Site Reliability Engineer to ensure the reliability and performance of its GPU marketplace. This role involves maintaining service level objectives, managing capacity, and implementing secure systems. The ideal candidate has strong...
Senior
Hyperbolic Labs
San Francisco, CA
3 days ago
Senior Site Reliability Engineer: Cloud Reliability Leader
Drata is seeking a Senior Site Reliability Engineer in San Francisco. In this role, you will engage in reliability architecture for product teams, lead production readiness reviews, and build automation around monitoring and alerting. The ideal candidate has at least 6...
Senior
Careers at Drata
San Francisco, CA
2 days ago
Senior Platform Reliability Engineer
$200k - $250k
A leading visual creation platform in San Francisco is seeking a Senior Owner of Stability and Infrastructure. This hands-on technical leadership role demands expertise in service reliability to ensure the platform's performance as it scales. Responsibilities include setting...
Senior
Vizcom
San Francisco, CA
1 day ago
Senior GPU HPC Platform Reliability Engineer
A leading AI research company in San Francisco is seeking a software engineer for its Fleet High Performance Computing team. In this role, you'll ensure the reliability and uptime of the compute fleet, working with automation systems and monitoring tools. Ideal candidates...
Senior
OpenAI
San Francisco, CA
2 days ago
Senior Platform & Reliability Engineer
AngelList Venture in San Francisco is seeking a Senior Infrastructure Engineer to build and optimize platform infrastructure that supports billions... ...enhance developer productivity through automation and reliability practices. The ideal candidate has a solid background in...
Senior
Work at office
AngelList Venture
San Francisco, CA
1 day ago
Senior Platform & Reliability Engineer — AI-Native Scale
OpenArt AI in San Francisco is seeking a Senior Platform & Reliability Engineer to design and improve the reliability of its infrastructure. The role emphasizes building and operating production systems while collaborating with product engineers to ensure platform scalability...
Senior
OpenArt AI
San Francisco, CA
23 hours ago
Senior SRE & Platform Engineer for AI-Driven Ops
$163k - $203k
GoTo Meeting is looking for a Senior Site Reliability Engineer in San Francisco. You will be responsible for the reliability, scalability, and security of Prosper’s Cloud Platform portfolio. This role requires expertise in Kubernetes, cloud platforms (preferably GCP),...
Senior
GoTo Meeting
San Francisco, CA
1 day ago
Senior Platform & Reliability Engineer (SRE)
$200k - $250k
...PostgreSQL, Redis, BullMQ queues, and Kubernetes-based production infrastructure. We’re hiring a senior owner of stability and infrastructure to ensure the platform is reliable, fast, and resilient as we scale. Role Mission Own service reliability end-to-end: prevent...
Senior
Permanent employment
Vizcom
San Francisco, CA
1 day ago
Senior Site Reliability Engineer — Cloud Infra Lead
Airwallex- is seeking a Senior Site Reliability Engineer in San Francisco, California, to work with product teams to build and maintain robust cloud infrastructure. In this role, you will lead critical infrastructure projects, ensuring the reliability and performance of...
Senior
Airwallex-
San Francisco, CA
1 day ago
Senior Cluster SRE & Cloud Ops Engineer
...experience , 5+ years of experience in Site Reliability Engineering, DevOps, or a similar role focused on... ...(AWS, GCP, Azure), including compute, networking, storage, and database services ,... ...job involves As a Member of Technical Staff, Cluster Management at Fireworks AI, you...
Senior
Fireworks AI
San Francisco, CA
1 day ago
Senior Staff Cloud Orchestration Architect
A pioneering AI infrastructure company is looking for a Senior Staff Software Engineer to lead initiatives in cloud software. This role requires over 10 years in software engineering with expertise in systems engineering and Kubernetes. Key responsibilities include setting...
Senior
Crusoe Energy Systems LLC
San Francisco, CA
4 days ago
Senior Site Reliability Engineer - Data Cloud SaaS
$151.5k - $252.5k
A leading technology firm is seeking a Senior Site Reliability Engineer to join their Data Cloud engineering team in San Francisco. The role requires expertise in Azure infrastructure and SaaS applications, focusing on building reliable, scalable systems. The ideal candidate...
Senior
Veeam
San Francisco, CA
2 days ago
Senior Platform & Reliability Engineer
Overview Senior Platform & Reliability Engineer OpenArt is an AI Storytelling and Visual Creation Platform used by millions worldwide. We’re building the next generation of creative tools powered by cutting-edge AI, enabling anyone to create videos, visuals, characters...
Senior
Remote work
Worldwide
Visa sponsorship
OpenArt AI
San Francisco, CA
23 hours ago
Senior Staff AI Cloud Hypervisor Architect
A leading tech company in San Francisco is seeking a Senior Staff Software Engineer who specializes in hypervisor virtualization. You will be responsible for optimizing virtualization technologies tailored for an AI cloud infrastructure. Proven expertise in hypervisor...
Senior
Full time
Crusoe Energy Systems LLC
San Francisco, CA
23 hours ago
Senior Manager, Site Reliability Engineering - Infrastructure Platform
$232k - $319k
...scale the service with great people and reliable, cost-effective, and efficient... ...oversee multiple teams focused on Edge networking, K8s platform, CI/CD, Observability, automation... ...partnership with architects and product engineering Build a world-class observability...
Senior
Permanent employment
Local area
Worldwide
Flexible hours
Okta, Inc.
San Francisco, CA
1 day ago
Senior / Staff Site Reliability, Platform Engineering
...runs on complex, distributed, cloud‑native systems. As a Staff Platform Engineer, you will play a critical role in ensuring these systems remain... ...engineering and technical leadership role. You will own reliability for major platform domains, design scalable solutions on...
Senior
Saviynt
San Francisco, CA
1 day ago
Senior Reliability & DFX Engineer for AI Accelerators
...A leading AI research organization in San Francisco is seeking a cross-stack engineer to ensure reliability in next-generation AI systems. This hands-on position requires extensive experience in reliability modeling and DFX architecture to enhance the durability and performance...
Senior
OpenAI
San Francisco, CA
2 days ago
Senior Offshore Mechanical Reliability Engineer
Hudson Manpower is seeking a Mechanical Engineer - Offshore Reliability for a role involving the improvement of offshore mechanical equipment reliability and performance. This position requires a Bachelor's Degree in Mechanical Engineering and a minimum of 12 years of experience...
Senior
Hudson Manpower
San Francisco, CA
1 day ago
Senior Reliability Engineer - Rotating Equipment
$160k - $190k
Southern Recruiting Solutions, Inc. seeks a Sr. Reliability Engineer based in San Francisco, California. This role requires a Bachelor's in Mechanical Engineering and over 8 years of experience in a chemical plant or refinery. The successful candidate will conduct root...
Senior
Southern Recruiting Solutions, Inc.
San Francisco, CA
1 day ago
Senior Network Engineer
$109k - $186k
...more. Base pay range $109,000.00/yr - $186,000.00/yr Network Deployment Engineer Position CRG Recruitment is working with our client to... ...up to date resume to Chris Butler at CRG for review. Seniority level ~ Mid‑Senior level Employment type...
Senior
Full time
CRG - People and Technology
San Francisco, CA
2 days ago
Senior Network Deployment Engineer - Design & Go-Live
$109k - $186k
...A technology recruitment agency is seeking a Network Deployment Engineer to design and implement wired and wireless networks. You will be responsible for translating customer needs into practical design and making technical decisions throughout the process. Ideal candidates...
Senior
CRG - People and Technology
San Francisco, CA
2 days ago
Senior Reliability Engineering
$150k - $180k
The Role As we continue to develop and deploy cutting-edge autonomous technologies, we are seeking a Senior Reliability Engineer (REL) to lead efforts in ensuring the long-term performance, durability, and robustness of critical hardware systems. This role is a key part...
Senior
Full time
Flexible hours
Eight Sleep
San Francisco, CA
3 days ago
Senior Network Engineer
...Job role : Senior Network Engineer Duration : 6 month contract Location : Bay Area, CA (100% onsite; primarily South San... ...engineers and leadership to ensure network scalability, reliability, and performance Travel between Bay Area offices as needed...
Senior
Contract work
Work at office
VDart
San Francisco, CA
2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior / Staff Network Reliability Engineer. Be the first to apply!