Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Staff Site Reliability Engineer - Automation and Platform

Cerebras Systems, Inc.

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. This architecture allows Cerebras to deliver industry-leading training and inference speeds; over 10 times faster than GPU-based hyperscale cloud inference services. This order of magnitude increase in speed is transforming the user experience of AI applications, unlocking real-time iteration and increasing intelligence via additional agentic computation. Cerebras works with the leading model labs, global enterprises, and cutting‑edge AI‑native startups. OpenAI recently announced a multi‑year partnership with Cerebras, to deploy 750 megawatts of scale, transforming key workloads with ultra high‑speed inference. About the Role We are building a high‑performance SRE function to support one of the world’s fastest‑growing AI inference services, powered by the Wafer‑Scale Engine (WSE). This team will help deliver world‑class, ultra‑reliable inference infrastructure for leading model builders such as OpenAI and other frontier labs. As a Staff SRE, you will lead the engineering effort to eliminate toil at scale by driving implementation of self‑service delivery pipelines and shared observability tooling. This role starts with approximately one month of hands‑on operational immersion to gain deep familiarity with our current stack, production pain points, and high‑stakes workflows. From there, your primary focus shifts to architecting and delivering the "tomorrow" layer: declarative GitOps‑driven continuous delivery for model releases, capacity provisioning, and cluster upgrades. Success in this role will be defined by enabling core teams, product managers, external customers, and cluster stakeholders to operate in a fully self‑service model with strong reliability guarantees. If you are a proven Staff+ engineer who enjoys turning complexity into elegant reliability at scale, this is your chance to lead this transformation from the front. This role does not require 24/7 on‑call rotations. Responsibilities Define and implement a robust strategy for delivering and running software reliably and at scale across multiple datacenters and cloud‑based solutions. Architect self‑service platforms and internal tooling that let product teams, external customers, and cluster operators safely trigger and observe critical workflows with minimal handoffs. Define and evolve reliability practices for inference workloads, including SLOs and SLIs for latency, throughput, and accuracy stability; error budgets; blameless postmortems; chaos testing; and capacity forecasting across multi‑datacenter and on‑prem environments. Mentor mid‑level SREs, support critical incident escalations, and use production pain points to prioritize the highest‑leverage automation work. Measure and drive impact through clear metrics, including toil reduction, deployment velocity, SLO compliance, MTTR, and adoption of self‑service workflows. Partner with our early‑career SRE sub‑team, who own day‑to‑day operations, to understand their pain points, automate operational toil, and mentor them as platform engineers. Collaborate with tech leads and leadership across core, cluster, cloud, and product teams to shift reliability from an operations‑only burden to a shared engineering discipline that underpins frontier AI inference at scale. Skills & Qualifications 8+ years of experience in SRE, infrastructure engineering, or platform engineering, with a strong record of improving automation and reliability at large scale in FAANG, hyperscaler, or similarly demanding environments. Deep expertise operating large‑scale heterogeneous clusters with a proprietary cloud control plane. Proven track record designing and delivering CI/CD or GitOps systems using Argo CD or similar tools, with strong safety and observability built in. Hands‑on experience with observability systems such as Loki, Tempo, Mimir, and Prometheus. Ability to lead complex projects end‑to‑end, influence cross‑functional stakeholders, and communicate technical direction clearly. Preferred Skills & Qualifications Experience with Bazel or other large‑scale build systems in production. Background in AI/ML inference systems, including model serving runtimes, GPU or wafer‑scale orchestration, latency and accuracy SLOs, or drift monitoring. Prior work on predictive autoscaling, chaos engineering, or cost‑aware capacity planning for compute‑intensive workloads. Location SF Bay Area Toronto Why Join Cerebras Build a breakthrough AI platform beyond the constraints of the GPU. Publish and open source their cutting‑edge AI research. Work on one of the fastest AI supercomputers in the world. Enjoy job stability with startup vitality. Our simple, non‑corporate work culture that respects individual beliefs. Find out more about what it's like to work at Cerebras. Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer. We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them. This website or its third‑party tools process personal data. For more details, click here to review our CCPA disclosure notice. #J-18808-Ljbffr Cerebras Systems, Inc.

Vacancy posted 5 hours ago
Similar jobs that could be interesting for youBased on the Staff Site Reliability Engineer - Automation and Platform in Sunnyvale, CA vacancy
  • Aurora is seeking a Staff Technical Lead Manager (TLM) to build and lead the Corporate Engineering team in Mountain View, California. This role entails developing the strategy...  ...in managing teams, integrating SaaS platforms, and a robust understanding of AI solutions.... 
    Platform

    I did my part and supported the Regular Toilet

    Mountain View, CA
    2 days ago
  •  ...Role: Site Reliability Engineer (SRE) Location: Santa Clara Valley (Cupertino), California...  ...and monitor new and existing services, platforms, and application stacks. Use...  ...infrastructure and applications through automation. Participate in periodic on-call... 
    Platform

    Zortech Solutions

    Santa Clara, CA
    16 hours ago
  • $150k

     ...S23 - Telecommunications Role - Site Reliability Engineer Location: Santa Clara, CA or Wall...  ...operating a modern Kubernetes-based platform that powers highly scalable, secure,...  ...focused on improving platform reliability, automation, observability, and developer... 
    Platform

    S23 Recruitment

    Santa Clara, CA
    7 hours ago
  •  ...our breach containment platform identifies and...  ...running. Location: 5 on-site days a week in Sunnyvale...  ...Team's Vision: Our Engineering team is shaping the...  ...experienced Senior Site Reliability Engineer (SRE) with a...  ...along with a passion for automation, continuous... 
    Platform
    Work experience placement
    Immediate start

    Illumio

    Sunnyvale, CA
    3 days ago
  • $217.57k - $260k

     ...otherwise, all roles are on-site five days per week at one...  ...Role Overview The Staff Site Reliability Engineer, Infrastructure role is building...  ...microservices and platform/API architectures. Our...  ...tooling, and infrastructure automation. Experience at large-scale... 
    Platform
    Full time
    Temporary work
    Work at office
    Remote work
    Flexible hours
    Shift work

    ID.me

    Mountain View, CA
    1 day ago
  • $200k - $260k

     ...Lead Site Reliability Engineer Glean is the Work AI platform that helps everyone work smarter with AI. What began as the industry's most advanced enterprise...  ...powers Glean's agentic capabilities - AI agents that automate real work across teams by accessing the industry's... 
    Platform
    Work at office
    Home office
    Flexible hours

    Softbank Investment Advisers

    Mountain View, CA
    1 day ago
  • Site Reliability Engineer Onsite- Bay Area, CA Skills Relevant Skills and Experience What You’ll Do (Day-to-Day) Own and manage our cloud infrastructure...  ...Ensure high availability, reliability, and uptime across platforms. Handle infrastructure maintenance, upgrades, and scaling... 
    Platform

    Amiri Recruiting

    Santa Clara, CA
    16 hours ago
  •  ...our breach containment platform identifies and...  ...running. Location: 5 on-site days a week in Sunnyvale...  ...Our Team's Vision: Our Engineering team is shaping the future...  ...Senior Site Reliability Engineer (SRE) with a...  ...along with a passion for automation, continuous improvement... 
    Platform
    Work experience placement

    Illumio

    Sunnyvale, CA
    3 days ago
  •  ...Role Overview You will be building an AI Data Center AIOps platform that turns raw, high‑volume telemetry into reliable, job‑centric insights and automation for GPU fleets. Join our team of innovative engineers who are building this platform and operating it (not the compute... 
    Platform

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $150k - $195k

     ...customers worldwide. The Role Automate as much as reasonable to...  ...operational efficiency of the Lacework platform Design, build and improve our...  ...best practices alongside engineering/operations teams to improve the scalability and reliability of internal processes. Participate... 
    Platform
    Full time
    Worldwide

    Isc2 Eastbay Chapter

    Sunnyvale, CA
    16 hours ago
  • $145k - $165k

     ...Graphics is seeking a highly experienced Site Reliability Engineer (SRE) to design, build, and operate...  ...Linux expertise and advanced automation capabilities are mandatory for success...  ...Hands‑on experience with virtualization platforms including Proxmox (current), VMware vSphere... 
    Platform
    Work at office

    Bolt Graphics

    Sunnyvale, CA
    16 hours ago
  • $147.4k - $272.1k

    Site Reliability Engineer, Enterprise Technology Services Sunnyvale, California, United States Software...  ...groundbreaking, world-changing platforms and services. Our ETS applications play...  ...working closely with application teams to automate operations, optimize infrastructure,... 
    Platform
    Relocation

    Apple Inc.

    Sunnyvale, CA
    2 days ago
  • $152k - $241.5k

     ...generation of our global services platform. At NVIDIA, you’ll keep...  ...to standardize and automate provisioning everywhere. Deliver...  ...lifecycle management, fleet reliability/auto‑healing, E2E observability...  ...Perl, or Ruby. Mentored other engineers and influenced technical... 
    Platform

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • $147.4k - $272.1k

     ...We are a team of software engineers developing web-based tools...  ...every day. We’re looking for a Site Reliability Engineer who thinks like a...  ...— you’ll shape how our platform evolves. Our team operates...  ...correctly, build self‑service automation, evolve our observability and... 
    Platform
    Relocation
    Shift work

    Apple Inc.

    Cupertino, CA
    3 days ago
  • A leading technology company is seeking a Site Reliability Engineer in Cupertino, California. The role involves owning the reliability of AWS...  ...collaborating with engineering teams for observability and automation. Candidates should have substantial experience with distributed... 
    Platform

    Apple Inc.

    Cupertino, CA
    4 days ago
  •  ...Europe Role Overview Seeking a Senior Site Reliability Engineer / DevOps Engineer to design, scale,...  ...from operating Kubernetes and cloud platforms at scale. The ideal candidate has deep...  ...Reduce operational toil through automation Why This Role You’ll own foundational... 
    Platform

    Prophet Town

    Mountain View, CA
    1 day ago
  • At NVIDIA, Site Reliability Engineering provides a rare chance to define, develop, and support large-...  ...centric monitoring and alerting. Apply automation and Generative AI/Agentic solutions...  ...‑on experience with observability platforms (e.g., Prometheus, Grafana). Strong... 
    Platform

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • $250k

     ...source of truth—explainable, reliable, and maintainable—that...  ...scalable and effective AI automation of business operations, with...  ...Overview As Director of Site Reliability Engineering, you will ensure that eGain...  ...s AI knowledge management platform operates with the reliability... 
    Platform
    Work at office

    eGain Corporation

    Sunnyvale, CA
    4 days ago
  • $200k - $322k

    Senior Manager, Site Reliability Engineering page is loaded## Senior Manager, Site Reliability Engineeringlocations...  ...Management into an intelligent, automated operating model using observability,...  ...of automation and orchestration platforms that reduce manual effort across the... 
    Platform

    NVIDIA Corporation

    Santa Clara, CA
    1 day ago
  • $180k - $260k

     ...We are seeking an experienced Senior/Staff Site Reliability Engineer to support the operation, monitoring...  ...closely with our infrastructure and platform teams to manage rollouts of both on‑...  ...‑facing infrastructure solutions. Automate deployment, scaling, and upgrading of... 
    Platform
    Odd job
    Work at office
    Remote work

    Booster

    Mountain View, CA
    4 days ago
  • Senior Staff Software Engineer, Site Reliability Engineering In accordance with Washington state law, we are highlighting...  ...consulting, developing software platforms and frameworks, capacity planning...  ...through mechanisms like automation, and evolve systems by pushing for... 
    Platform
    Temporary work

    Google Inc.

    Sunnyvale, CA
    3 days ago
  • $126k - $204.5k

     ...industry’s most advanced SecOps platform, consisting of XDR, XSIAM,...  ...closely with our engineering teams to develop innovative...  ...minimal impact on services. Automate complex monitoring and alerting...  ...the product and ensure the reliability and availability of our services... 
    Platform

    Palo Alto Networks, Inc.

    Santa Clara, CA
    2 days ago
  • $207k - $300k

    Staff Software Engineer, Site Reliability Engineering corporate_fare Google place Sunnyvale, CA, USA Apply Bachelor...  ...and eliminating work through automation. On the SRE team, you’ll have the...  ...the next generation of Google platforms, we make Google's product portfolio... 
    Platform
    Full time

    Google Inc.

    Sunnyvale, CA
    16 hours ago
  • $150k - $180k

     ...looking for candidates to serve as software-focused Senior Site Reliability Engineer at Verrus. This is a full‑time position based out of...  ...operations as a software problem: building tooling, automation, platforms and integrations that allow Verrus to deploy and manage... 
    Platform
    Full time
    Work at office
    Local area
    Flexible hours

    Verrus, LLC

    Mountain View, CA
    3 days ago
  • $202k - $247k

    Job Category Site Reliability Engineering Posting Date 11/18/2025, 12:24 AM Locations Santa Clara, CA, United States Job...  ...are looking for engineers with passion for automation. You will help support the FortiCNAPP platform and play a key role in building, operating, and... 
    Platform
    Full time
    Worldwide

    Fortinet, Inc.

    Santa Clara, CA
    1 day ago
  • $151.6k - $245.3k

     ...largest GCP customers. As a Principal Site Reliability Engineer for the ADEM (Autonomous Digital...  ...our global customers. This includes automation, architecture, performance, observability...  ...and reliability. Our Infrastructure Platform stack includes Terraform, Kubernetes,... 
    Platform
    Full time
    Work at office
    Visa sponsorship
    Work visa

    Palo Alto Networks, Inc.

    Santa Clara, CA
    2 days ago
  • A leading tech recruiting firm is seeking a Site Reliability Engineer to manage and optimize cloud infrastructure primarily using GCP or AWS....  ...requires no customer interaction and focuses on improving platform architecture and reliability. #J-18808-Ljbffr Amiri Recruiting
    Platform

    Amiri Recruiting

    Mountain View, CA
    4 days ago
  • $176k - $276k

    Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain...  ...on eliminating manual work through automation, performance tuning and growing...  ...Observability & Telemetry collection platform with a focus on performance at scale... 
    Platform

    NVIDIA Corporation

    Santa Clara, CA
    16 hours ago
  • $181.1k - $318.4k

    Senior Site Reliability Engineer, Storage SRE / Apple Services Engineering Cupertino, California, United...  ...and technical lead in our Apple Data Platform (ADP) SRE organization, you will...  ...processes, developing shared tooling and automation, and ensuring that SRE principles are... 
    Platform
    Relocation

    Apple Inc.

    Cupertino, CA
    1 day ago
  • $207k - $301k

    Software Engineering Manager, Site Reliability Engineering Sunnyvale, CA, USA Qualifications Bachelor’s degree in Computer Science, a related field...  ...data correctness, and restore integrity after complex platform incidents. Directing the resource headroom and efficiency... 
    Platform

    Google Inc.

    Sunnyvale, CA
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Staff Site Reliability Engineer - Automation and Platform. Be the first to apply!