Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior Network & Site Reliability Engineer

Alembic, Inc.

About Us Alembic is the pioneering Causal AI platform. We help the world's largest enterprises move past correlation to prove what actually drives business outcomes — the question marketing and growth teams have never been able to answer with confidence. Fortune 100 companies including Nvidia, Delta Air Lines, and Mars use Alembic to make multimillion-dollar decisions on trusted, causal evidence. We're backed by a $145M Series B from WndrCo (founded by Jeffrey Katzenberg), Jensen Huang, Joe Montana, Prysm Capital, and Accenture. Our models run on our own NVIDIA DGX SuperPOD built on Grace Blackwell infrastructure — one of the fastest private supercomputers in the world. (We've melted GPUs getting here.) About the Role We're building infrastructure that has to perform under real-world scale, reliability, and security demands — and we're looking for an engineer who wants to own the foundation it runs on. This isn't a traditional "keep the lights on" role. You will design and operate the global network and reliability layer behind one of the world's fastest private supercomputers — the fabric powering distributed compute, ML workloads, real-time analytics, and mission-critical enterprise systems. You'll work across networking, systems, automation, observability, and reliability engineering to scale a platform where performance genuinely matters, with real influence over architecture decisions. It's a strong fit if you like solving deep infrastructure problems, building resilient systems, automating everything repetitive, and owning architecture rather than just maintaining it. What You'll Do Architect and operate scalable, secure network architecture for high-security requirements and large‑scale machine learning workloads. Own network device configuration management end to end, ensuring consistency and reliability across the fleet. Improve system and network reliability and performance through automation, observability, and proactive capacity planning. Implement and manage complex network protocols and connectivity, including BGP, VPNs, and WAN circuits and external peering. Build and maintain comprehensive monitoring, alerting, and incident response — SLOs, runbooks, and on-call rotations — and drive post‑incident analysis and continuous improvement. Ensure security, compliance, and operational readiness across our network and cloud infrastructure. Partner across engineering and data science to drive a culture of performance and reliability. What Will Help You Succeed 8+ years in network or infrastructure engineering, including 5+ years in datacenter operations and/or systems and network administration. A strong background in network security, architecture, design, and operations. Extensive hands‑on experience with network devices (firewalls, switches, load balancers) and large-scale architectures and protocols — BGP, QoS, MPLS, and IPsec VPNs. Experience designing and operating modern datacenter network fabrics (spine‑leaf, EVPN/VXLAN, ECMP). Network automation and IaC tooling (Ansible, Terraform, Nornir, or similar), plus IPAM/DCIM platforms (NetBox, Infoblox, or similar). WAN engineering — carrier circuit provisioning and external network peering. Familiarity with Kubernetes networking (CNI plugins, ingress, service networking, network policy) and strong operational experience with Linux-based production infrastructure. Experience with monitoring and observability stacks (Prometheus, Grafana, Datadog, ELK, OpenTelemetry). Solid scripting (Python, Bash) to debug complex network and system issues and automate solutions, plus excellent cross‑functional communication. Also Helpful NVIDIA networking technologies — Cumulus Linux, InfiniBand, Spectrum‑X, and BlueField DPUs (this is the fabric behind our SuperPOD). Familiarity with data‑intensive platforms (Spark, Airflow, Kafka) and storage network protocols (NFS, LustreFS, iSCSI). Security practices for applications and infrastructure, and experience in high‑compliance or SOC 2 environments. The Role Is Right for You If You want to own mission-critical network and infrastructure end to end — from architecture to incident management — not just keep it running. You’d rather build and automate than direct from a distance, and you want meaningful influence over how a high‑performance platform scales. Why You Might Be Excited About Alembic Hard problems with real impact : You'll own the network and reliability layer behind systems that influence multimillion‑dollar decisions at Fortune 100 companies. Cutting‑edge technology : Operate our own NVIDIA DGX SuperPOD on Grace Blackwell — one of the fastest private supercomputers in the world — and run a fabric (InfiniBand, Spectrum‑X, BlueField) almost no company has in‑house. Technical autonomy : Ownership over architecture decisions and the freedom to solve hard infrastructure problems your way. Elite team : Join top engineers who thrive on hard problems and high‑impact work. Series B momentum, real ownership : Meaningful equity at a Series B company that's raised $145M, with proven product‑market fit and Fortune 100 traction. Why You Might Not Be Excited If you only want to tell people what to build instead of building and automating alongside them, this isn't the environment for you. You prefer companies with 100% built‑out process for every detail. You prefer static over dynamic — projects and priorities adapt as we grow. We have real paying customers and a playbook, and we still move at startup speed at Series B scale. #J-18808-Ljbffr Alembic, Inc.

Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Senior Network & Site Reliability Engineer in San Francisco, CA vacancy
  • $227.2k - $324.5k

     ...About the Role: Site Reliability Engineering (SRE) at Tubi is not a traditional operations team....  ...seeking an experienced and visionary Senior SRE Manager to lead and grow our newly...  ...knowledge of AWS services (especially networking, IAM, EKS, ALBs/NLBs, Route 53, CloudWatch... 
    Senior
    Network
    Full time
    Contract work
    Temporary work
    Local area
    Flexible hours

    Tubi

    San Francisco, CA
    12 hours ago
  • Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco • Full-Time About Andromeda Andromeda Cluster was founded...  .... Since then, we have been quietly building the systems, network, and orchestration layer that makes the world’s AI... 
    Senior
    Network
    Full time
    Remote work

    Cortes 23

    San Francisco, CA
    1 day ago
  • $232k - $319k

     ...scale the service with great people and reliable, cost-effective, and efficient...  ...oversee multiple teams focused on Edge networking, K8s platform, Observability, automation...  ...Accelerate the velocity of SRE and product engineering by developing robust platforms, powerful... 
    Senior
    Network
    Permanent employment
    Local area
    Worldwide
    Flexible hours

    Okta, Inc.

    San Francisco, CA
    19 hours ago
  • $300k

     ...full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the reliability, performance, and automation...  ...-availability GPU workloads. Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU... 
    Senior
    Network
    Permanent employment
    San Francisco, CA
    more than 2 months ago
  • $250k

     ...in the United States. The company is looking for a Senior / Staff Site Reliability Engineer to support and scale large-scale HPC and cloud environments...  ...modern GPU cloud providers Strong understanding of networking fundamentals (DNS, TCP/IP, routing, performance... 
    Senior
    Network
    Permanent employment
    Remote work
    San Francisco, CA
    19 days ago
  •  ...acquisition, and Connor was a machine learning research engineer at Scale AI. The rest of our team comes from...  ...redefining go-to-market with state-of-the-art AI. As a Senior SRE, you'll tackle the scaling and reliability challenges that come with adding terabytes of data... 
    Senior

    Unify

    San Francisco, CA
    1 day ago
  •  ...daily users while enabling our engineering teams to ship fast. You'll...  ...and tooling that improves reliability and partnering with engineering...  ...including compute, networking, databases, and managed services...  ...you'll bring ~5+ years in Site Reliability Engineering, DevOps... 
    Network
    Work at office
    Work from home

    gamma.app

    San Francisco, CA
    4 days ago
  •  ...About the role We’re hiring an SRE to join our engineering team at Plenful and take ownership of the reliability and performance of the systems that power our product...  ...and predictable resource usage across compute, networking and storage. Security, Compliance and... 
    Network
    Work at office
    Remote work
    Flexible hours
    2 days per week

    Plenful

    San Francisco, CA
    4 days ago
  • $261k - $326k

     ...specializing in AI infrastructure is seeking a Principal Engineer to enhance reliability and scalability of cloud systems. This role demands over...  ...operational excellence. Candidates should have strong networking expertise and systems fundamentals, especially in high-scale... 
    Senior
    Network

    Crusoe

    San Francisco, CA
    1 day ago
  •  ...fast-growing, early-stage startup to identify a top-tier Site Reliability Engineer who will play a critical role in scaling and strengthening...  ...and resolving issues related to memory management, networking, and system reliability Ability to work directly with customers... 
    Network

    Velia multiservices

    San Francisco, CA
    25 days ago
  •  ...Job Description Forhyre is looking for engineers who can bring unique perspectives and...  ...practices while building a culture of reliability and observability Engage in and improve...  ..., preferably Kubernetes and networking technology Hands-on experience in one... 
    Network

    Forhyre

    San Francisco, CA
    25 days ago
  • ~ Senior Software Engineer (Rust) at Symbolica – San Francisco, US Senior Software Engineer (Rust...  ...focus on scaling data‑hungry neural networks, we’re building AI that understands...  ...who wants to build systems that work reliably, at scale, and in the real world.... 
    Senior
    Network
    Work at office
    Shift work

    Victrays

    San Francisco, CA
    3 days ago
  •  ...clinicians across hundreds of care sites nationwide – more than $10...  ...Role We’re looking for a Senior Engineering Manager to lead the Frontend...  ..., low-latency, high-reliability product used by clinicians during...  ..., and offline/poor-network behavior Establish patterns... 
    Senior
    Network
    Work at office
    Local area

    COMMURE Incorporated

    San Francisco, CA
    1 day ago
  • $175k - $250k

     ...fast‑growing customer base of SaaS companies. About the Site Reliability Engineering Team The Site Reliability Engineering (SRE) team ensures...  ...comfortable working across infrastructure layers—from compute and networking to storage, databases, and app runtime environments Are... 
    Network
    Remote work

    I did my part and supported the Regular Toilet

    San Francisco, CA
    1 day ago
  • The role We're looking for a world-class Site Reliability Engineer to ensure the reliability, performance, and scalability of our AI infrastructure...  ...to eliminate toil and scale ops. Work across compute, networking, storage, and sandboxed execution layers to tune... 
    Network

    Blaxel

    San Francisco, CA
    2 days ago
  •  ...cloud-native systems. As a Staff Platform Engineer, you will play a critical role in...  ...technical leadership role. You will own reliability for major platform domains, design scalable...  ...Infrastructure Development, Platform Engineering, or Site Reliability Engineering role, with a... 
    Senior

    Saviynt

    San Francisco, CA
    6 days ago
  • $170k - $277k

    Palo Alto Networks, Inc. is seeking a Senior Principal Backend Engineer in San Francisco, CA, to lead the backend development for industry-leading products like Cortex XSOAR. You will drive project lifecycles, collaborate across teams, and utilize skills in Python and cloud... 
    Senior
    Network

    Palo Alto Networks, Inc.

    San Francisco, CA
    3 days ago
  •  ...maintaining Microsoft 365 environments. Ideal candidates will have 10+ years in IT support and relevant certifications. Skills in networking, customer communication, and a familiarity with various IT tools are essential. The position promises a dynamic work environment... 
    Senior
    Network

    Parachute Technology

    San Francisco, CA
    1 day ago
  • $151k - $297k

    The Team Platform Engineering is the department within SRE that is responsible for a range...  ...cloud-provider Kubernetes infrastructure, networking, load balancing (including our public-...  ...components that ensure cluster reliability and security (e.g., CoreDNS, cert-manager... 
    Network
    Local area
    Immediate start
    Remote work
    Flexible hours
    Shift work

    MongoDB

    San Francisco, CA
    12 hours ago
  • Palo Alto Networks, Inc. is seeking an Infrastructure Engineer to build tooling that enhances developer velocity and ensures reliability across our engineering organization. You'll work with modern cloud-native technologies, tackling challenges related to the development... 
    Senior
    Network

    Palo Alto Networks, Inc.

    San Francisco, CA
    3 days ago
  • $138k - $179k

     ...partner with a wide variety of other teams from infrastructure and engineering, to QA and business teams, so strong collaborative instincts...  ...and take responsibility for achieving results. A global network of talented colleagues, who inspire, support, and share their... 
    Network
    Flexible hours

    MSCI Inc

    San Francisco, CA
    12 hours ago
  •  ...behalf of one of our customers. She will pick the best candidates from Jack's network The next step is to speak to Jack. Job Title: Senior Platform and Infrastructure Engineer Company Description: Context - Lux Capital and General Catalyst backed AI startup... 
    Senior
    Network
    Live in

    Jack and Jill AI

    San Francisco, CA
    3 days ago
  • Alembic, Inc. is looking for an experienced engineer to design and operate the global network of one of the world's fastest private supercomputers. The role demands strong skills in infrastructure engineering, network security, and automation for scalable operations. As... 
    Senior
    Network

    Alembic, Inc.

    San Francisco, CA
    1 day ago
  • $221k - $271k

    WinsAbove is seeking a Senior Solutions Engineer based in San Francisco. The ideal candidate has extensive experience in technical sales and a...  ...'s degree or equivalent, with a focus on web security and networking technologies. The position offers competitive salaries ranging... 
    Senior
    Network

    WinsAbove

    San Francisco, CA
    2 days ago
  • $175k - $240k

     ...evaluation. We're a fast-moving team looking for a systems / database engineer to help design, optimize, and harden our system. Within 6...  ...cloud object storage is a plus. ~ Strong fundamentals in networking, OS concepts, and systems debugging. Compensation... 
    Senior
    Network
    Work at office
    Flexible hours

    LangChain

    San Francisco, CA
    3 days ago
  • Crusoe Energy Systems LLC is looking for a Senior Staff Network Automation Engineer to build intelligent automation systems for their extensive network...  ...production automation, ensuring high scalability and reliability. The ideal candidate will have over 12 years of... 
    Senior
    Network

    Crusoe Energy Systems LLC

    San Francisco, CA
    12 hours ago
  • $163k - $191.5k

     ...within organizations, between brands, and across its premier global network of top-quality partners.****Hundreds of global innovators, from...  ....*** **Work with a team of supportive and passionate software engineers.*** **Architect and implement systems that materialize our... 
    Senior
    Network
    Work at office
    Remote work
    Work from home
    Worldwide
    Flexible hours
    Night shift

    LiveRamp

    San Francisco, CA
    3 days ago
  •  ...integrating our advanced airframe and engine technologies—which include...  ...Astro Mechanica is seeking a Senior Flight Software Engineer to...  ..., operating system, networking, and firmware. You will work...  ...YOU’LL DO: Develop highly reliable autonomous software systems and... 
    Senior
    Network
    Work at office
    Flexible hours

    Astro Mechanica

    San Francisco, CA
    12 hours ago
  •  ...on our team! Why Join Us: We’re seeking several Software Engineers with full stack (any mix of front end, backend, and database)...  ...Familiarity with cross-browser compatibility, accessibility, browser networking, and browser APIs such as IndexedDB and WebSockets. ~... 
    Senior
    Network
    Full time
    Remote work
    Flexible hours

    Rad AI

    San Francisco, CA
    1 day ago
  •  ...Job Description Job Description Looking for a Senior Forward Deployed AI Engineer to lead the deployment and customization of AI-powered solutions...  ...healthcare or regulated environments Proficiency in Python Experience with machine learning and neural networks... 
    Senior
    Network
    Immediate start

    AccrueTalent

    San Francisco, CA
    24 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Network & Site Reliability Engineer. Be the first to apply!