Senior Network & Site Reliability Engineer

Alembic, Inc.

About Us Alembic is the pioneering Causal AI platform. We help the world's largest enterprises move past correlation to prove what actually drives business outcomes — the question marketing and growth teams have never been able to answer with confidence. Fortune 100 companies including Nvidia, Delta Air Lines, and Mars use Alembic to make multimillion-dollar decisions on trusted, causal evidence. We're backed by a $145M Series B from WndrCo (founded by Jeffrey Katzenberg), Jensen Huang, Joe Montana, Prysm Capital, and Accenture. Our models run on our own NVIDIA DGX SuperPOD built on Grace Blackwell infrastructure — one of the fastest private supercomputers in the world. (We've melted GPUs getting here.) About the Role We're building infrastructure that has to perform under real-world scale, reliability, and security demands — and we're looking for an engineer who wants to own the foundation it runs on. This isn't a traditional "keep the lights on" role. You will design and operate the global network and reliability layer behind one of the world's fastest private supercomputers — the fabric powering distributed compute, ML workloads, real-time analytics, and mission-critical enterprise systems. You'll work across networking, systems, automation, observability, and reliability engineering to scale a platform where performance genuinely matters, with real influence over architecture decisions. It's a strong fit if you like solving deep infrastructure problems, building resilient systems, automating everything repetitive, and owning architecture rather than just maintaining it. What You'll Do Architect and operate scalable, secure network architecture for high-security requirements and large‑scale machine learning workloads. Own network device configuration management end to end, ensuring consistency and reliability across the fleet. Improve system and network reliability and performance through automation, observability, and proactive capacity planning. Implement and manage complex network protocols and connectivity, including BGP, VPNs, and WAN circuits and external peering. Build and maintain comprehensive monitoring, alerting, and incident response — SLOs, runbooks, and on-call rotations — and drive post‑incident analysis and continuous improvement. Ensure security, compliance, and operational readiness across our network and cloud infrastructure. Partner across engineering and data science to drive a culture of performance and reliability. What Will Help You Succeed 8+ years in network or infrastructure engineering, including 5+ years in datacenter operations and/or systems and network administration. A strong background in network security, architecture, design, and operations. Extensive hands‑on experience with network devices (firewalls, switches, load balancers) and large-scale architectures and protocols — BGP, QoS, MPLS, and IPsec VPNs. Experience designing and operating modern datacenter network fabrics (spine‑leaf, EVPN/VXLAN, ECMP). Network automation and IaC tooling (Ansible, Terraform, Nornir, or similar), plus IPAM/DCIM platforms (NetBox, Infoblox, or similar). WAN engineering — carrier circuit provisioning and external network peering. Familiarity with Kubernetes networking (CNI plugins, ingress, service networking, network policy) and strong operational experience with Linux-based production infrastructure. Experience with monitoring and observability stacks (Prometheus, Grafana, Datadog, ELK, OpenTelemetry). Solid scripting (Python, Bash) to debug complex network and system issues and automate solutions, plus excellent cross‑functional communication. Also Helpful NVIDIA networking technologies — Cumulus Linux, InfiniBand, Spectrum‑X, and BlueField DPUs (this is the fabric behind our SuperPOD). Familiarity with data‑intensive platforms (Spark, Airflow, Kafka) and storage network protocols (NFS, LustreFS, iSCSI). Security practices for applications and infrastructure, and experience in high‑compliance or SOC 2 environments. The Role Is Right for You If You want to own mission-critical network and infrastructure end to end — from architecture to incident management — not just keep it running. You’d rather build and automate than direct from a distance, and you want meaningful influence over how a high‑performance platform scales. Why You Might Be Excited About Alembic Hard problems with real impact : You'll own the network and reliability layer behind systems that influence multimillion‑dollar decisions at Fortune 100 companies. Cutting‑edge technology : Operate our own NVIDIA DGX SuperPOD on Grace Blackwell — one of the fastest private supercomputers in the world — and run a fabric (InfiniBand, Spectrum‑X, BlueField) almost no company has in‑house. Technical autonomy : Ownership over architecture decisions and the freedom to solve hard infrastructure problems your way. Elite team : Join top engineers who thrive on hard problems and high‑impact work. Series B momentum, real ownership : Meaningful equity at a Series B company that's raised $145M, with proven product‑market fit and Fortune 100 traction. Why You Might Not Be Excited If you only want to tell people what to build instead of building and automating alongside them, this isn't the environment for you. You prefer companies with 100% built‑out process for every detail. You prefer static over dynamic — projects and priorities adapt as we grow. We have real paying customers and a playbook, and we still move at startup speed at Series B scale. #J-18808-Ljbffr Alembic, Inc.

Apply

Vacancy posted 1 day ago

Similar jobs that could be interesting for youBased on the Senior Network & Site Reliability Engineer in San Francisco, CA vacancy

Senior Manager, Site Reliability Engineering
$227.2k - $324.5k
...About the Role: Site Reliability Engineering (SRE) at Tubi is not a traditional operations team.... ...seeking an experienced and visionary Senior SRE Manager to lead and grow our newly... ...knowledge of AWS services (especially networking, IAM, EKS, ALBs/NLBs, Route 53, CloudWatch...
Senior
Network
Full time
Contract work
Temporary work
Local area
Flexible hours
Tubi
San Francisco, CA
12 hours ago
Senior Site Reliability Engineer AI Infrastructure
Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco • Full-Time About Andromeda Andromeda Cluster was founded... .... Since then, we have been quietly building the systems, network, and orchestration layer that makes the world’s AI...
Senior
Network
Full time
Remote work
Cortes 23
San Francisco, CA
1 day ago
Senior Manager, Site Reliability Engineering - Infrastructure Platform
$232k - $319k
...scale the service with great people and reliable, cost-effective, and efficient... ...oversee multiple teams focused on Edge networking, K8s platform, Observability, automation... ...Accelerate the velocity of SRE and product engineering by developing robust platforms, powerful...
Senior
Network
Permanent employment
Local area
Worldwide
Flexible hours
Okta, Inc.
San Francisco, CA
19 hours ago
Senior Site Reliability Engineer (SRE) - AI Inftastructure
$300k
...full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the reliability, performance, and automation... ...-availability GPU workloads. Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU...
Senior
Network
Permanent employment
San Francisco, CA
more than 2 months ago
Senior Site Reliability Engineer (GPU Clusters) - Hosting
$250k
...in the United States. The company is looking for a Senior / Staff Site Reliability Engineer to support and scale large-scale HPC and cloud environments... ...modern GPU cloud providers Strong understanding of networking fundamentals (DNS, TCP/IP, routing, performance...
Senior
Network
Permanent employment
Remote work
San Francisco, CA
19 days ago
Senior Site Reliability Engineer
...acquisition, and Connor was a machine learning research engineer at Scale AI. The rest of our team comes from... ...redefining go-to-market with state-of-the-art AI. As a Senior SRE, you'll tackle the scaling and reliability challenges that come with adding terabytes of data...
Senior
Unify
San Francisco, CA
1 day ago
Site Reliability Engineer
...daily users while enabling our engineering teams to ship fast. You'll... ...and tooling that improves reliability and partnering with engineering... ...including compute, networking, databases, and managed services... ...you'll bring ~5+ years in Site Reliability Engineering, DevOps...
Network
Work at office
Work from home
gamma.app
San Francisco, CA
4 days ago
Site Reliability Engineer
...About the role We’re hiring an SRE to join our engineering team at Plenful and take ownership of the reliability and performance of the systems that power our product... ...and predictable resource usage across compute, networking and storage. Security, Compliance and...
Network
Work at office
Remote work
Flexible hours
2 days per week
Plenful
San Francisco, CA
4 days ago
Senior Principal Cloud Infra Reliability Engineer
$261k - $326k
...specializing in AI infrastructure is seeking a Principal Engineer to enhance reliability and scalability of cloud systems. This role demands over... ...operational excellence. Candidates should have strong networking expertise and systems fundamentals, especially in high-scale...
Senior
Network
Crusoe
San Francisco, CA
1 day ago
Site Reliability Engineer
...fast-growing, early-stage startup to identify a top-tier Site Reliability Engineer who will play a critical role in scaling and strengthening... ...and resolving issues related to memory management, networking, and system reliability Ability to work directly with customers...
Network
Velia multiservices
San Francisco, CA
25 days ago
Site Reliability Engineering
...Job Description Forhyre is looking for engineers who can bring unique perspectives and... ...practices while building a culture of reliability and observability Engage in and improve... ..., preferably Kubernetes and networking technology Hands-on experience in one...
Network
Forhyre
San Francisco, CA
25 days ago
Senior Software Engineer (Rust) at Symbolica - San Francisco, US
~ Senior Software Engineer (Rust) at Symbolica – San Francisco, US Senior Software Engineer (Rust... ...focus on scaling data‑hungry neural networks, we’re building AI that understands... ...who wants to build systems that work reliably, at scale, and in the real world....
Senior
Network
Work at office
Shift work
Victrays
San Francisco, CA
3 days ago
Senior Engineering Manager, Frontend Platform (Ambient AI)
...clinicians across hundreds of care sites nationwide – more than $10... ...Role We’re looking for a Senior Engineering Manager to lead the Frontend... ..., low-latency, high-reliability product used by clinicians during... ..., and offline/poor-network behavior Establish patterns...
Senior
Network
Work at office
Local area
COMMURE Incorporated
San Francisco, CA
1 day ago
Site Reliability Engineer
$175k - $250k
...fast‑growing customer base of SaaS companies. About the Site Reliability Engineering Team The Site Reliability Engineering (SRE) team ensures... ...comfortable working across infrastructure layers—from compute and networking to storage, databases, and app runtime environments Are...
Network
Remote work
I did my part and supported the Regular Toilet
San Francisco, CA
1 day ago
Site Reliability Engineer
The role We're looking for a world-class Site Reliability Engineer to ensure the reliability, performance, and scalability of our AI infrastructure... ...to eliminate toil and scale ops. Work across compute, networking, storage, and sandboxed execution layers to tune...
Network
Blaxel
San Francisco, CA
2 days ago
Senior / Staff Site Reliability, Platform Engineering
...cloud-native systems. As a Staff Platform Engineer, you will play a critical role in... ...technical leadership role. You will own reliability for major platform domains, design scalable... ...Infrastructure Development, Platform Engineering, or Site Reliability Engineering role, with a...
Senior
Saviynt
San Francisco, CA
6 days ago
Senior Principal Backend Engineer, Cortex Platform
$170k - $277k
Palo Alto Networks, Inc. is seeking a Senior Principal Backend Engineer in San Francisco, CA, to lead the backend development for industry-leading products like Cortex XSOAR. You will drive project lifecycles, collaborate across teams, and utilize skills in Python and cloud...
Senior
Network
Palo Alto Networks, Inc.
San Francisco, CA
3 days ago
Senior IT Systems Engineer: 365, Azure & Security
...maintaining Microsoft 365 environments. Ideal candidates will have 10+ years in IT support and relevant certifications. Skills in networking, customer communication, and a familiarity with various IT tools are essential. The position promises a dynamic work environment...
Senior
Network
Parachute Technology
San Francisco, CA
1 day ago
Manager, Site Reliability Engineering - Fleet Management
$151k - $297k
The Team Platform Engineering is the department within SRE that is responsible for a range... ...cloud-provider Kubernetes infrastructure, networking, load balancing (including our public-... ...components that ensure cluster reliability and security (e.g., CoreDNS, cert-manager...
Network
Local area
Immediate start
Remote work
Flexible hours
Shift work
MongoDB
San Francisco, CA
12 hours ago
Senior Platform Engineer - Cloud Infra & Developer Tools
Palo Alto Networks, Inc. is seeking an Infrastructure Engineer to build tooling that enhances developer velocity and ensures reliability across our engineering organization. You'll work with modern cloud-native technologies, tackling challenges related to the development...
Senior
Network
Palo Alto Networks, Inc.
San Francisco, CA
3 days ago
Site reliability engineer - Vice president
$138k - $179k
...partner with a wide variety of other teams from infrastructure and engineering, to QA and business teams, so strong collaborative instincts... ...and take responsibility for achieving results. A global network of talented colleagues, who inspire, support, and share their...
Network
Flexible hours
MSCI Inc
San Francisco, CA
12 hours ago
Senior Platform and Infrastructure Engineer at Context
...behalf of one of our customers. She will pick the best candidates from Jack's network The next step is to speak to Jack. Job Title: Senior Platform and Infrastructure Engineer Company Description: Context - Lux Capital and General Catalyst backed AI startup...
Senior
Network
Live in
Jack and Jill AI
San Francisco, CA
3 days ago
Senior Network & Reliability Engineer — Equity & Impact
Alembic, Inc. is looking for an experienced engineer to design and operate the global network of one of the world's fastest private supercomputers. The role demands strong skills in infrastructure engineering, network security, and automation for scalable operations. As...
Senior
Network
Alembic, Inc.
San Francisco, CA
1 day ago
Senior Solutions Engineer - Enterprise Cloud & Security
$221k - $271k
WinsAbove is seeking a Senior Solutions Engineer based in San Francisco. The ideal candidate has extensive experience in technical sales and a... ...'s degree or equivalent, with a focus on web security and networking technologies. The position offers competitive salaries ranging...
Senior
Network
WinsAbove
San Francisco, CA
2 days ago
Senior Software Engineer, SmithDB
$175k - $240k
...evaluation. We're a fast-moving team looking for a systems / database engineer to help design, optimize, and harden our system. Within 6... ...cloud object storage is a plus. ~ Strong fundamentals in networking, OS concepts, and systems debugging. Compensation...
Senior
Network
Work at office
Flexible hours
LangChain
San Francisco, CA
3 days ago
Senior Network Automation Platform Engineer
Crusoe Energy Systems LLC is looking for a Senior Staff Network Automation Engineer to build intelligent automation systems for their extensive network... ...production automation, ensuring high scalability and reliability. The ideal candidate will have over 12 years of...
Senior
Network
Crusoe Energy Systems LLC
San Francisco, CA
12 hours ago
Senior Software Engineer - Ingestion
$163k - $191.5k
...within organizations, between brands, and across its premier global network of top-quality partners.****Hundreds of global innovators, from... ....*** **Work with a team of supportive and passionate software engineers.*** **Architect and implement systems that materialize our...
Senior
Network
Work at office
Remote work
Work from home
Worldwide
Flexible hours
Night shift
LiveRamp
San Francisco, CA
3 days ago
Senior Flight Software Engineer
...integrating our advanced airframe and engine technologies—which include... ...Astro Mechanica is seeking a Senior Flight Software Engineer to... ..., operating system, networking, and firmware. You will work... ...YOU’LL DO: Develop highly reliable autonomous software systems and...
Senior
Network
Work at office
Flexible hours
Astro Mechanica
San Francisco, CA
12 hours ago
Senior Software Engineer, Full Stack - Reporting
...on our team! Why Join Us: We’re seeking several Software Engineers with full stack (any mix of front end, backend, and database)... ...Familiarity with cross-browser compatibility, accessibility, browser networking, and browser APIs such as IndexedDB and WebSockets. ~...
Senior
Network
Full time
Remote work
Flexible hours
Rad AI
San Francisco, CA
1 day ago
Senior Forward Deployed AI Engineer
...Job Description Job Description Looking for a Senior Forward Deployed AI Engineer to lead the deployment and customization of AI-powered solutions... ...healthcare or regulated environments Proficiency in Python Experience with machine learning and neural networks...
Senior
Network
Immediate start
AccrueTalent
San Francisco, CA
24 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Network & Site Reliability Engineer. Be the first to apply!