Staff Site Reliability Engineer - Automation and Platform

Cerebras Systems, Inc.

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. This architecture allows Cerebras to deliver industry-leading training and inference speeds; over 10 times faster than GPU-based hyperscale cloud inference services. This order of magnitude increase in speed is transforming the user experience of AI applications, unlocking real-time iteration and increasing intelligence via additional agentic computation. Cerebras works with the leading model labs, global enterprises, and cutting‑edge AI‑native startups. OpenAI recently announced a multi‑year partnership with Cerebras, to deploy 750 megawatts of scale, transforming key workloads with ultra high‑speed inference. About the Role We are building a high‑performance SRE function to support one of the world’s fastest‑growing AI inference services, powered by the Wafer‑Scale Engine (WSE). This team will help deliver world‑class, ultra‑reliable inference infrastructure for leading model builders such as OpenAI and other frontier labs. As a Staff SRE, you will lead the engineering effort to eliminate toil at scale by driving implementation of self‑service delivery pipelines and shared observability tooling. This role starts with approximately one month of hands‑on operational immersion to gain deep familiarity with our current stack, production pain points, and high‑stakes workflows. From there, your primary focus shifts to architecting and delivering the "tomorrow" layer: declarative GitOps‑driven continuous delivery for model releases, capacity provisioning, and cluster upgrades. Success in this role will be defined by enabling core teams, product managers, external customers, and cluster stakeholders to operate in a fully self‑service model with strong reliability guarantees. If you are a proven Staff+ engineer who enjoys turning complexity into elegant reliability at scale, this is your chance to lead this transformation from the front. This role does not require 24/7 on‑call rotations. Responsibilities Define and implement a robust strategy for delivering and running software reliably and at scale across multiple datacenters and cloud‑based solutions. Architect self‑service platforms and internal tooling that let product teams, external customers, and cluster operators safely trigger and observe critical workflows with minimal handoffs. Define and evolve reliability practices for inference workloads, including SLOs and SLIs for latency, throughput, and accuracy stability; error budgets; blameless postmortems; chaos testing; and capacity forecasting across multi‑datacenter and on‑prem environments. Mentor mid‑level SREs, support critical incident escalations, and use production pain points to prioritize the highest‑leverage automation work. Measure and drive impact through clear metrics, including toil reduction, deployment velocity, SLO compliance, MTTR, and adoption of self‑service workflows. Partner with our early‑career SRE sub‑team, who own day‑to‑day operations, to understand their pain points, automate operational toil, and mentor them as platform engineers. Collaborate with tech leads and leadership across core, cluster, cloud, and product teams to shift reliability from an operations‑only burden to a shared engineering discipline that underpins frontier AI inference at scale. Skills & Qualifications 8+ years of experience in SRE, infrastructure engineering, or platform engineering, with a strong record of improving automation and reliability at large scale in FAANG, hyperscaler, or similarly demanding environments. Deep expertise operating large‑scale heterogeneous clusters with a proprietary cloud control plane. Proven track record designing and delivering CI/CD or GitOps systems using Argo CD or similar tools, with strong safety and observability built in. Hands‑on experience with observability systems such as Loki, Tempo, Mimir, and Prometheus. Ability to lead complex projects end‑to‑end, influence cross‑functional stakeholders, and communicate technical direction clearly. Preferred Skills & Qualifications Experience with Bazel or other large‑scale build systems in production. Background in AI/ML inference systems, including model serving runtimes, GPU or wafer‑scale orchestration, latency and accuracy SLOs, or drift monitoring. Prior work on predictive autoscaling, chaos engineering, or cost‑aware capacity planning for compute‑intensive workloads. Location SF Bay Area Toronto Why Join Cerebras Build a breakthrough AI platform beyond the constraints of the GPU. Publish and open source their cutting‑edge AI research. Work on one of the fastest AI supercomputers in the world. Enjoy job stability with startup vitality. Our simple, non‑corporate work culture that respects individual beliefs. Find out more about what it's like to work at Cerebras. Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer. We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them. This website or its third‑party tools process personal data. For more details, click here to review our CCPA disclosure notice. #J-18808-Ljbffr Cerebras Systems, Inc.

Apply

Vacancy posted 5 hours ago

Similar jobs that could be interesting for youBased on the Staff Site Reliability Engineer - Automation and Platform in Sunnyvale, CA vacancy

Staff TLM — Enterprise AI & Automation Engineering
Aurora is seeking a Staff Technical Lead Manager (TLM) to build and lead the Corporate Engineering team in Mountain View, California. This role entails developing the strategy... ...in managing teams, integrating SaaS platforms, and a robust understanding of AI solutions....
Platform
I did my part and supported the Regular Toilet
Mountain View, CA
2 days ago
Site Reliability Engineer (SRE)
...Role: Site Reliability Engineer (SRE) Location: Santa Clara Valley (Cupertino), California... ...and monitor new and existing services, platforms, and application stacks. Use... ...infrastructure and applications through automation. Participate in periodic on-call...
Platform
Zortech Solutions
Santa Clara, CA
16 hours ago
Site Reliability Engineer
$150k
...S23 - Telecommunications Role - Site Reliability Engineer Location: Santa Clara, CA or Wall... ...operating a modern Kubernetes-based platform that powers highly scalable, secure,... ...focused on improving platform reliability, automation, observability, and developer...
Platform
S23 Recruitment
Santa Clara, CA
7 hours ago
Sr. Site Reliability Engineer
...our breach containment platform identifies and... ...running. Location: 5 on-site days a week in Sunnyvale... ...Team's Vision: Our Engineering team is shaping the... ...experienced Senior Site Reliability Engineer (SRE) with a... ...along with a passion for automation, continuous...
Platform
Work experience placement
Immediate start
Illumio
Sunnyvale, CA
3 days ago
Staff Site Reliability Engineer
$217.57k - $260k
...otherwise, all roles are on-site five days per week at one... ...Role Overview The Staff Site Reliability Engineer, Infrastructure role is building... ...microservices and platform/API architectures. Our... ...tooling, and infrastructure automation. Experience at large-scale...
Platform
Full time
Temporary work
Work at office
Remote work
Flexible hours
Shift work
ID.me
Mountain View, CA
1 day ago
Lead Site Reliability Engineer
$200k - $260k
...Lead Site Reliability Engineer Glean is the Work AI platform that helps everyone work smarter with AI. What began as the industry's most advanced enterprise... ...powers Glean's agentic capabilities - AI agents that automate real work across teams by accessing the industry's...
Platform
Work at office
Home office
Flexible hours
Softbank Investment Advisers
Mountain View, CA
1 day ago
Site Reliability Engineer
Site Reliability Engineer Onsite- Bay Area, CA Skills Relevant Skills and Experience What You’ll Do (Day-to-Day) Own and manage our cloud infrastructure... ...Ensure high availability, reliability, and uptime across platforms. Handle infrastructure maintenance, upgrades, and scaling...
Platform
Amiri Recruiting
Santa Clara, CA
16 hours ago
Sr. Site Reliability Engineer
...our breach containment platform identifies and... ...running. Location: 5 on-site days a week in Sunnyvale... ...Our Team's Vision: Our Engineering team is shaping the future... ...Senior Site Reliability Engineer (SRE) with a... ...along with a passion for automation, continuous improvement...
Platform
Work experience placement
Illumio
Sunnyvale, CA
3 days ago
Senior Site Reliability Engineer, AIOPs
...Role Overview You will be building an AI Data Center AIOps platform that turns raw, high‑volume telemetry into reliable, job‑centric insights and automation for GPU fleets. Join our team of innovative engineers who are building this platform and operating it (not the compute...
Platform
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Site Reliability Engineer - $150,000-$195,000 Fortinet
$150k - $195k
...customers worldwide. The Role Automate as much as reasonable to... ...operational efficiency of the Lacework platform Design, build and improve our... ...best practices alongside engineering/operations teams to improve the scalability and reliability of internal processes. Participate...
Platform
Full time
Worldwide
Isc2 Eastbay Chapter
Sunnyvale, CA
16 hours ago
Site Reliability Engineer
$145k - $165k
...Graphics is seeking a highly experienced Site Reliability Engineer (SRE) to design, build, and operate... ...Linux expertise and advanced automation capabilities are mandatory for success... ...Hands‑on experience with virtualization platforms including Proxmox (current), VMware vSphere...
Platform
Work at office
Bolt Graphics
Sunnyvale, CA
16 hours ago
Site Reliability Engineer, Enterprise Technology Services
$147.4k - $272.1k
Site Reliability Engineer, Enterprise Technology Services Sunnyvale, California, United States Software... ...groundbreaking, world-changing platforms and services. Our ETS applications play... ...working closely with application teams to automate operations, optimize infrastructure,...
Platform
Relocation
Apple Inc.
Sunnyvale, CA
2 days ago
Senior Site Reliability Engineer - HPC
$152k - $241.5k
...generation of our global services platform. At NVIDIA, you’ll keep... ...to standardize and automate provisioning everywhere. Deliver... ...lifecycle management, fleet reliability/auto‑healing, E2E observability... ...Perl, or Ruby. Mentored other engineers and influenced technical...
Platform
NVIDIA Gruppe
Santa Clara, CA
2 days ago
Site Reliability Engineer — Human Engineering
$147.4k - $272.1k
...We are a team of software engineers developing web-based tools... ...every day. We’re looking for a Site Reliability Engineer who thinks like a... ...— you’ll shape how our platform evolves. Our team operates... ...correctly, build self‑service automation, evolve our observability and...
Platform
Relocation
Shift work
Apple Inc.
Cupertino, CA
3 days ago
Site Reliability Engineer: Platform & Observability
A leading technology company is seeking a Site Reliability Engineer in Cupertino, California. The role involves owning the reliability of AWS... ...collaborating with engineering teams for observability and automation. Candidates should have substantial experience with distributed...
Platform
Apple Inc.
Cupertino, CA
4 days ago
Senior Site Reliability Engineer / DevOps Engineer
...Europe Role Overview Seeking a Senior Site Reliability Engineer / DevOps Engineer to design, scale,... ...from operating Kubernetes and cloud platforms at scale. The ideal candidate has deep... ...Reduce operational toil through automation Why This Role You’ll own foundational...
Platform
Prophet Town
Mountain View, CA
1 day ago
Site Reliability Engineer - Hardware Infrastructure
At NVIDIA, Site Reliability Engineering provides a rare chance to define, develop, and support large-... ...centric monitoring and alerting. Apply automation and Generative AI/Agentic solutions... ...‑on experience with observability platforms (e.g., Prometheus, Grafana). Strong...
Platform
NVIDIA Gruppe
Santa Clara, CA
2 days ago
Director, Site Reliability Engineering Sunnyvale, CA , USA
$250k
...source of truth—explainable, reliable, and maintainable—that... ...scalable and effective AI automation of business operations, with... ...Overview As Director of Site Reliability Engineering, you will ensure that eGain... ...s AI knowledge management platform operates with the reliability...
Platform
Work at office
eGain Corporation
Sunnyvale, CA
4 days ago
Senior Manager, Site Reliability Engineering
$200k - $322k
Senior Manager, Site Reliability Engineering page is loaded## Senior Manager, Site Reliability Engineeringlocations... ...Management into an intelligent, automated operating model using observability,... ...of automation and orchestration platforms that reduce manual effort across the...
Platform
NVIDIA Corporation
Santa Clara, CA
1 day ago
Senior/Staff Site Reliability Engineer
$180k - $260k
...We are seeking an experienced Senior/Staff Site Reliability Engineer to support the operation, monitoring... ...closely with our infrastructure and platform teams to manage rollouts of both on‑... ...‑facing infrastructure solutions. Automate deployment, scaling, and upgrading of...
Platform
Odd job
Work at office
Remote work
Booster
Mountain View, CA
4 days ago
Senior Staff Software Engineer, Site Reliability Engineering
Senior Staff Software Engineer, Site Reliability Engineering In accordance with Washington state law, we are highlighting... ...consulting, developing software platforms and frameworks, capacity planning... ...through mechanisms like automation, and evolve systems by pushing for...
Platform
Temporary work
Google Inc.
Sunnyvale, CA
3 days ago
Senior Staff Site Reliability Engineer
$126k - $204.5k
...industry’s most advanced SecOps platform, consisting of XDR, XSIAM,... ...closely with our engineering teams to develop innovative... ...minimal impact on services. Automate complex monitoring and alerting... ...the product and ensure the reliability and availability of our services...
Platform
Palo Alto Networks, Inc.
Santa Clara, CA
2 days ago
Staff Software Engineer, Site Reliability Engineering
$207k - $300k
Staff Software Engineer, Site Reliability Engineering corporate_fare Google place Sunnyvale, CA, USA Apply Bachelor... ...and eliminating work through automation. On the SRE team, you’ll have the... ...the next generation of Google platforms, we make Google's product portfolio...
Platform
Full time
Google Inc.
Sunnyvale, CA
16 hours ago
Staff Site Reliability Engineer
$150k - $180k
...looking for candidates to serve as software-focused Senior Site Reliability Engineer at Verrus. This is a full‑time position based out of... ...operations as a software problem: building tooling, automation, platforms and integrations that allow Verrus to deploy and manage...
Platform
Full time
Work at office
Local area
Flexible hours
Verrus, LLC
Mountain View, CA
3 days ago
Principal Site Reliability Engineer
$202k - $247k
Job Category Site Reliability Engineering Posting Date 11/18/2025, 12:24 AM Locations Santa Clara, CA, United States Job... ...are looking for engineers with passion for automation. You will help support the FortiCNAPP platform and play a key role in building, operating, and...
Platform
Full time
Worldwide
Fortinet, Inc.
Santa Clara, CA
1 day ago
Principal Site Reliability Engineer ( U.S Citizenship required )
$151.6k - $245.3k
...largest GCP customers. As a Principal Site Reliability Engineer for the ADEM (Autonomous Digital... ...our global customers. This includes automation, architecture, performance, observability... ...and reliability. Our Infrastructure Platform stack includes Terraform, Kubernetes,...
Platform
Full time
Work at office
Visa sponsorship
Work visa
Palo Alto Networks, Inc.
Santa Clara, CA
2 days ago
Senior Site Reliability Engineer: Cloud, Kubernetes & CI/CD
A leading tech recruiting firm is seeking a Site Reliability Engineer to manage and optimize cloud infrastructure primarily using GCP or AWS.... ...requires no customer interaction and focuses on improving platform architecture and reliability. #J-18808-Ljbffr Amiri Recruiting
Platform
Amiri Recruiting
Mountain View, CA
4 days ago
Senior Site Reliability Engineer - Observability and Telemetry Platform
$176k - $276k
Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain... ...on eliminating manual work through automation, performance tuning and growing... ...Observability & Telemetry collection platform with a focus on performance at scale...
Platform
NVIDIA Corporation
Santa Clara, CA
16 hours ago
Senior Site Reliability Engineer, Storage SRE / Apple Services Engineering
$181.1k - $318.4k
Senior Site Reliability Engineer, Storage SRE / Apple Services Engineering Cupertino, California, United... ...and technical lead in our Apple Data Platform (ADP) SRE organization, you will... ...processes, developing shared tooling and automation, and ensuring that SRE principles are...
Platform
Relocation
Apple Inc.
Cupertino, CA
1 day ago
Software Engineering Manager, Site Reliability Engineering
$207k - $301k
Software Engineering Manager, Site Reliability Engineering Sunnyvale, CA, USA Qualifications Bachelor’s degree in Computer Science, a related field... ...data correctness, and restore integrity after complex platform incidents. Directing the resource headroom and efficiency...
Platform
Google Inc.
Sunnyvale, CA
1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Staff Site Reliability Engineer - Automation and Platform. Be the first to apply!