Staff Site Reliability Engineer - Automation and Platform
Cerebras Systems, Inc.
Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. This architecture allows Cerebras to deliver industry-leading training and inference speeds; over 10 times faster than GPU-based hyperscale cloud inference services. This order of magnitude increase in speed is transforming the user experience of AI applications, unlocking real-time iteration and increasing intelligence via additional agentic computation. Cerebras works with the leading model labs, global enterprises, and cutting‑edge AI‑native startups. OpenAI recently announced a multi‑year partnership with Cerebras, to deploy 750 megawatts of scale, transforming key workloads with ultra high‑speed inference. About the Role We are building a high‑performance SRE function to support one of the world’s fastest‑growing AI inference services, powered by the Wafer‑Scale Engine (WSE). This team will help deliver world‑class, ultra‑reliable inference infrastructure for leading model builders such as OpenAI and other frontier labs. As a Staff SRE, you will lead the engineering effort to eliminate toil at scale by driving implementation of self‑service delivery pipelines and shared observability tooling. This role starts with approximately one month of hands‑on operational immersion to gain deep familiarity with our current stack, production pain points, and high‑stakes workflows. From there, your primary focus shifts to architecting and delivering the "tomorrow" layer: declarative GitOps‑driven continuous delivery for model releases, capacity provisioning, and cluster upgrades. Success in this role will be defined by enabling core teams, product managers, external customers, and cluster stakeholders to operate in a fully self‑service model with strong reliability guarantees. If you are a proven Staff+ engineer who enjoys turning complexity into elegant reliability at scale, this is your chance to lead this transformation from the front. This role does not require 24/7 on‑call rotations. Responsibilities Define and implement a robust strategy for delivering and running software reliably and at scale across multiple datacenters and cloud‑based solutions. Architect self‑service platforms and internal tooling that let product teams, external customers, and cluster operators safely trigger and observe critical workflows with minimal handoffs. Define and evolve reliability practices for inference workloads, including SLOs and SLIs for latency, throughput, and accuracy stability; error budgets; blameless postmortems; chaos testing; and capacity forecasting across multi‑datacenter and on‑prem environments. Mentor mid‑level SREs, support critical incident escalations, and use production pain points to prioritize the highest‑leverage automation work. Measure and drive impact through clear metrics, including toil reduction, deployment velocity, SLO compliance, MTTR, and adoption of self‑service workflows. Partner with our early‑career SRE sub‑team, who own day‑to‑day operations, to understand their pain points, automate operational toil, and mentor them as platform engineers. Collaborate with tech leads and leadership across core, cluster, cloud, and product teams to shift reliability from an operations‑only burden to a shared engineering discipline that underpins frontier AI inference at scale. Skills & Qualifications 8+ years of experience in SRE, infrastructure engineering, or platform engineering, with a strong record of improving automation and reliability at large scale in FAANG, hyperscaler, or similarly demanding environments. Deep expertise operating large‑scale heterogeneous clusters with a proprietary cloud control plane. Proven track record designing and delivering CI/CD or GitOps systems using Argo CD or similar tools, with strong safety and observability built in. Hands‑on experience with observability systems such as Loki, Tempo, Mimir, and Prometheus. Ability to lead complex projects end‑to‑end, influence cross‑functional stakeholders, and communicate technical direction clearly. Preferred Skills & Qualifications Experience with Bazel or other large‑scale build systems in production. Background in AI/ML inference systems, including model serving runtimes, GPU or wafer‑scale orchestration, latency and accuracy SLOs, or drift monitoring. Prior work on predictive autoscaling, chaos engineering, or cost‑aware capacity planning for compute‑intensive workloads. Location SF Bay Area Toronto Why Join Cerebras Build a breakthrough AI platform beyond the constraints of the GPU. Publish and open source their cutting‑edge AI research. Work on one of the fastest AI supercomputers in the world. Enjoy job stability with startup vitality. Our simple, non‑corporate work culture that respects individual beliefs. Find out more about what it's like to work at Cerebras. Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer. We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them. This website or its third‑party tools process personal data. For more details, click here to review our CCPA disclosure notice. #J-18808-Ljbffr Cerebras Systems, Inc.
- Aurora is seeking a Staff Technical Lead Manager (TLM) to build and lead the Corporate Engineering team in Mountain View, California. This role entails developing the strategy... ...in managing teams, integrating SaaS platforms, and a robust understanding of AI solutions....Platform
- ...Role: Site Reliability Engineer (SRE) Location: Santa Clara Valley (Cupertino), California... ...and monitor new and existing services, platforms, and application stacks. Use... ...infrastructure and applications through automation. Participate in periodic on-call...Platform
$150k
...S23 - Telecommunications Role - Site Reliability Engineer Location: Santa Clara, CA or Wall... ...operating a modern Kubernetes-based platform that powers highly scalable, secure,... ...focused on improving platform reliability, automation, observability, and developer...Platform- ...our breach containment platform identifies and... ...running. Location: 5 on-site days a week in Sunnyvale... ...Team's Vision: Our Engineering team is shaping the... ...experienced Senior Site Reliability Engineer (SRE) with a... ...along with a passion for automation, continuous...PlatformWork experience placementImmediate start
$217.57k - $260k
...otherwise, all roles are on-site five days per week at one... ...Role Overview The Staff Site Reliability Engineer, Infrastructure role is building... ...microservices and platform/API architectures. Our... ...tooling, and infrastructure automation. Experience at large-scale...PlatformFull timeTemporary workWork at officeRemote workFlexible hoursShift work$200k - $260k
...Lead Site Reliability Engineer Glean is the Work AI platform that helps everyone work smarter with AI. What began as the industry's most advanced enterprise... ...powers Glean's agentic capabilities - AI agents that automate real work across teams by accessing the industry's...PlatformWork at officeHome officeFlexible hours- Site Reliability Engineer Onsite- Bay Area, CA Skills Relevant Skills and Experience What You’ll Do (Day-to-Day) Own and manage our cloud infrastructure... ...Ensure high availability, reliability, and uptime across platforms. Handle infrastructure maintenance, upgrades, and scaling...Platform
- ...our breach containment platform identifies and... ...running. Location: 5 on-site days a week in Sunnyvale... ...Our Team's Vision: Our Engineering team is shaping the future... ...Senior Site Reliability Engineer (SRE) with a... ...along with a passion for automation, continuous improvement...PlatformWork experience placement
- ...Role Overview You will be building an AI Data Center AIOps platform that turns raw, high‑volume telemetry into reliable, job‑centric insights and automation for GPU fleets. Join our team of innovative engineers who are building this platform and operating it (not the compute...Platform
$150k - $195k
...customers worldwide. The Role Automate as much as reasonable to... ...operational efficiency of the Lacework platform Design, build and improve our... ...best practices alongside engineering/operations teams to improve the scalability and reliability of internal processes. Participate...PlatformFull timeWorldwide$145k - $165k
...Graphics is seeking a highly experienced Site Reliability Engineer (SRE) to design, build, and operate... ...Linux expertise and advanced automation capabilities are mandatory for success... ...Hands‑on experience with virtualization platforms including Proxmox (current), VMware vSphere...PlatformWork at office$147.4k - $272.1k
Site Reliability Engineer, Enterprise Technology Services Sunnyvale, California, United States Software... ...groundbreaking, world-changing platforms and services. Our ETS applications play... ...working closely with application teams to automate operations, optimize infrastructure,...PlatformRelocation$152k - $241.5k
...generation of our global services platform. At NVIDIA, you’ll keep... ...to standardize and automate provisioning everywhere. Deliver... ...lifecycle management, fleet reliability/auto‑healing, E2E observability... ...Perl, or Ruby. Mentored other engineers and influenced technical...Platform$147.4k - $272.1k
...We are a team of software engineers developing web-based tools... ...every day. We’re looking for a Site Reliability Engineer who thinks like a... ...— you’ll shape how our platform evolves. Our team operates... ...correctly, build self‑service automation, evolve our observability and...PlatformRelocationShift work- A leading technology company is seeking a Site Reliability Engineer in Cupertino, California. The role involves owning the reliability of AWS... ...collaborating with engineering teams for observability and automation. Candidates should have substantial experience with distributed...Platform
- ...Europe Role Overview Seeking a Senior Site Reliability Engineer / DevOps Engineer to design, scale,... ...from operating Kubernetes and cloud platforms at scale. The ideal candidate has deep... ...Reduce operational toil through automation Why This Role You’ll own foundational...Platform
- At NVIDIA, Site Reliability Engineering provides a rare chance to define, develop, and support large-... ...centric monitoring and alerting. Apply automation and Generative AI/Agentic solutions... ...‑on experience with observability platforms (e.g., Prometheus, Grafana). Strong...Platform
$250k
...source of truth—explainable, reliable, and maintainable—that... ...scalable and effective AI automation of business operations, with... ...Overview As Director of Site Reliability Engineering, you will ensure that eGain... ...s AI knowledge management platform operates with the reliability...PlatformWork at office$200k - $322k
Senior Manager, Site Reliability Engineering page is loaded## Senior Manager, Site Reliability Engineeringlocations... ...Management into an intelligent, automated operating model using observability,... ...of automation and orchestration platforms that reduce manual effort across the...Platform$180k - $260k
...We are seeking an experienced Senior/Staff Site Reliability Engineer to support the operation, monitoring... ...closely with our infrastructure and platform teams to manage rollouts of both on‑... ...‑facing infrastructure solutions. Automate deployment, scaling, and upgrading of...PlatformOdd jobWork at officeRemote work- Senior Staff Software Engineer, Site Reliability Engineering In accordance with Washington state law, we are highlighting... ...consulting, developing software platforms and frameworks, capacity planning... ...through mechanisms like automation, and evolve systems by pushing for...PlatformTemporary work
$126k - $204.5k
...industry’s most advanced SecOps platform, consisting of XDR, XSIAM,... ...closely with our engineering teams to develop innovative... ...minimal impact on services. Automate complex monitoring and alerting... ...the product and ensure the reliability and availability of our services...Platform$207k - $300k
Staff Software Engineer, Site Reliability Engineering corporate_fare Google place Sunnyvale, CA, USA Apply Bachelor... ...and eliminating work through automation. On the SRE team, you’ll have the... ...the next generation of Google platforms, we make Google's product portfolio...PlatformFull time$150k - $180k
...looking for candidates to serve as software-focused Senior Site Reliability Engineer at Verrus. This is a full‑time position based out of... ...operations as a software problem: building tooling, automation, platforms and integrations that allow Verrus to deploy and manage...PlatformFull timeWork at officeLocal areaFlexible hours$202k - $247k
Job Category Site Reliability Engineering Posting Date 11/18/2025, 12:24 AM Locations Santa Clara, CA, United States Job... ...are looking for engineers with passion for automation. You will help support the FortiCNAPP platform and play a key role in building, operating, and...PlatformFull timeWorldwide$151.6k - $245.3k
...largest GCP customers. As a Principal Site Reliability Engineer for the ADEM (Autonomous Digital... ...our global customers. This includes automation, architecture, performance, observability... ...and reliability. Our Infrastructure Platform stack includes Terraform, Kubernetes,...PlatformFull timeWork at officeVisa sponsorshipWork visa- A leading tech recruiting firm is seeking a Site Reliability Engineer to manage and optimize cloud infrastructure primarily using GCP or AWS.... ...requires no customer interaction and focuses on improving platform architecture and reliability. #J-18808-Ljbffr Amiri RecruitingPlatform
$176k - $276k
Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain... ...on eliminating manual work through automation, performance tuning and growing... ...Observability & Telemetry collection platform with a focus on performance at scale...Platform$181.1k - $318.4k
Senior Site Reliability Engineer, Storage SRE / Apple Services Engineering Cupertino, California, United... ...and technical lead in our Apple Data Platform (ADP) SRE organization, you will... ...processes, developing shared tooling and automation, and ensuring that SRE principles are...PlatformRelocation$207k - $301k
Software Engineering Manager, Site Reliability Engineering Sunnyvale, CA, USA Qualifications Bachelor’s degree in Computer Science, a related field... ...data correctness, and restore integrity after complex platform incidents. Directing the resource headroom and efficiency...Platform
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Staff Site Reliability Engineer - Automation and Platform. Be the first to apply!
- engineering aide Sunnyvale, CA
- senior staff systems engineer Sunnyvale, CA
- staff design engineer Sunnyvale, CA
- staff engineer Sunnyvale, CA
- technology administrator Sunnyvale, CA
- senior staff engineer Sunnyvale, CA
- assistant engineer Sunnyvale, CA
- software engineer staff Sunnyvale, CA
- site reliability engineer sre Sunnyvale, CA
- site reliability engineer Sunnyvale, CA

