AI Infrastructure Operations Engineer

CEREBRAS SYSTEMS INC.

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs.

Cerebras' current customers include top model labs, global enterprises, and cutting-edge AI-native startups. OpenAI recently announced a multi-year partnership with Cerebras, to deploy 750 megawatts of scale, transforming key workloads with ultra high-speed inference.

Thanks to the groundbreaking wafer-scale architecture, Cerebras Inference offers the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services. This order of magnitude increase in speed is transforming the user experience of AI applications, unlocking real-time iteration and increasing intelligence via additional agentic computation.

About The Role

We are seeking a highly skilled and experienced AI Infrastructure Operations Engineer to manage and operate our cutting-edge machine learning compute clusters. These clusters would provide the candidate an opportunity to work with the world's largest computer chip, the Wafer-Scale Engine (WSE), and the systems that harness its unparalleled power.

You will play a critical role in ensuring the health, performance, and availability of our infrastructure, maximizing compute capacity, and supporting our growing AI initiatives. This role requires a deep understanding of Linux-based systems, containerization technologies, and experience with monitoring and troubleshooting complex distributed systems. The ideal candidate is a proactive problem-solver with expertise in large-scale compute infrastructure, dependable and an advocate for customer success.
Responsibilities

Manage and operate multiple advanced AI compute infrastructure clusters.
Monitor and oversee cluster health, proactively identifying and resolving potential issues.
Maximize compute capacity through optimization and efficient resource allocation.
Deploy, configure, and debug container-based services using Docker.
Provide 24/7 monitoring and support, leveraging automated tools and performing hands-on troubleshooting as needed.
Handle engineering escalations and collaborate with other teams to resolve complex technical challenges.
Contribute to the development and improvement of our monitoring and support processes.
Stay up-to-date with the latest advancements in AI compute infrastructure and related technologies.

Skills And Requirements

6-8 years of relevant experience in managing and operating complex compute infrastructure, preferably in the context of machine learning or high-performance computing.
Strong proficiency in Python scripting for automation and system administration.
Deep understanding of Linux-based compute systems and command-line tools.
Extensive knowledge of Docker containers and container orchestration platforms like k8s and SLURM.
Proven ability to troubleshoot and resolve complex technical issues in a timely and efficient manner.
Experience with monitoring and alerting systems.
Should have a proven track record to own and drive challenges to completion.
Excellent communication and collaboration skills.
Ability to work effectively in a fast-paced environment.
Willingness to participate in a 24/7 on-call rotation.

Preferred Skills And Requirements

Operating large scale GPU clusters.
Knowledge of technologies like Ethernet, RoCE, TCP/IP, etc. is desired.
Knowledge of cloud computing platforms (e.g., AWS, GCP, Azure).
Familiarity with machine learning frameworks and tools.
Experience with cross-functional team projects.

Location

SF Bay Area.
Toronto, Canada.
Bangalore, India.

Why Join Cerebras

People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we've reached an inflection point in our business. Members of our team tell us there are five main reasons they joined Cerebras:

Build a breakthrough AI platform beyond the constraints of the GPU.
Publish and open source their cutting-edge AI research.
Work on one of the fastest AI supercomputers in the world.
Enjoy job stability with startup vitality.
Our simple, non-corporate work culture that respects individual beliefs.

Read our blog: Five Reasons to Join Cerebras in 2026.
Apply today and become part of the forefront of groundbreaking advancements in AI!

Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer. We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them.

This website or its third-party tools process personal data. For more details, click here to review our CCPA disclosure notice.

Apply

Vacancy posted 4 days ago

Similar jobs that could be interesting for youBased on the AI Infrastructure Operations Engineer in Sunnyvale, CA vacancy

Operations Engineer, HPC Networking
$90k - $110k
...CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave... ..., CoreWeave combines superior infrastructure performance with deep technical expertise... ...seeking a dedicated and detail-oriented Operations Engineer to join our HPC Networking Team. HPC...
Suggested
Permanent employment
Temporary work
Casual work
Work at office
Flexible hours
CoreWeave
Sunnyvale, CA
26 days ago
IT Helpdesk & Cloud Operations Engineer
...A high-performance AI infrastructure company in Santa Clara is looking for an IT Helpdesk and Operations Engineer. This role involves supporting and designing IT systems, managing security protocols, and leading significant IT projects. Candidates should have 2-3+ years...
Suggested
Nexthop Systems Inc
Santa Clara, CA
4 days ago
Staff Security Operations Engineer AI-Driven Platform
...General Motors is hiring a Staff Security Software Engineer for their Cybersecurity Team in Mountain View, California. This role requires... ...strategy and architecture for enterprise-scale projects. You will lead AI-driven initiatives, shape organizational practices, and mentor...
Suggested
General Motors
Mountain View, CA
3 days ago
Senior Data Center Operations Engineer, Google Cloud
$159k - $231k
Senior Data Center Operations Engineer, Google Cloud Sunnyvale, CA, USA Qualifications Bachelor... ...Lab team manages this critical infrastructure to ensure product teams can focus on core... ...specialized development groups. The AI and Infrastructure team is redefining...
Suggested
Full time
Work at office
Worldwide
Google Inc.
Sunnyvale, CA
4 days ago
Principal Network Automation Engineer
$248k - $396.75k
...the unlimited potential of AI to define the next era of computing... ...the performance of our infrastructure both on‑prem and cloud. Join... ...skilled Principal AI/ML Engineer to join our dynamic team to... ...architecture/standards/reuse, and operational documentation via Confluence...
Suggested
NVIDIA Gruppe
Santa Clara, CA
2 days ago
GenAI Network Operations Engineer
A technology solutions company in Palo Alto seeks an experienced AI/ML Engineer to build and deploy GenAI-powered tools aimed at enhancing network troubleshooting. Candidates should possess a solid foundation in GenAI patterns, Python, and SQL, with 3-5 years in AI/ML...
Robotics Prcocess Automation, LLC
Palo Alto, CA
4 days ago
Senior Cloud Test Engineer — AI-Powered Automation
...Member of Technical Staff (Sr. MTS) to join their Cloud Test team in Santa Clara, CA. Responsibilities include collaborating with engineers on product requirements, designing test plans, writing automated tests in Python or Go, and performing performance testing. Required...
Aviatrix
Santa Clara, CA
1 day ago
Principal Cloud Operations Engineer (10166)
$180k - $225k
...days per week Extreme’s Cloud Operations team is a group of talented engineers passionate about building highly... ...operation, as well as cloud infrastructure design and implementation. Together... ...and best practices and leverages AI and cloud service provider platforms...
Work experience placement
Work at office
Local area
2 days per week
1 day per week
Extreme Networks
San Jose, CA
5 days ago
Lead Cloud Engineering and Production Operations Engineer
...Incedo: Incedo is a global AI and data transformation... ...for strategy to execution, we operate at the intersection of business... ...foundation of AI & Data, digital engineering, and operations... ...engineering initiatives, automating infrastructure, and ensuring high-availability...
Worldwide
Qode
San Jose, CA
9 days ago
Staff Software Engineer, Network Automation
$215k - $260k
...intelligence . As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of the stack — from electrons to... ...This Role: Crusoe is seeking a Staff Software Engineer to design and deliver the automation frameworks...
Full time
Temporary work
Crusoe
Sunnyvale, CA
12 days ago
AI-Driven Network QA Automation Engineer | Equity Eligible
$168k - $322k
A leading technology firm in Santa Clara is looking for an experienced Senior QA Automation Engineer to enhance their Network AI platform. You will ensure the quality of AI/ML-powered network solutions through manual testing and Python automation. Candidates should have...
NVIDIA Corporation
Santa Clara, CA
5 days ago
Principal Cloud Engineering and Production Operations Engineer
$140k - $185k
...Principal Cloud Engineering and Production Operations Engineer The Principal Cloud and Production Operations... .... This role combines deep cloud infrastructure expertise with strong production... ...access security model Exposure to AI/ML infrastructure or data-driven...
For subcontractor
Local area
A10 Networks
San Jose, CA
2 days ago
Principal Software Engineer, DGX Cloud Production Engineering
$272k - $431.25k
...NVIDIA DGX Cloud is scaling GPU infrastructure across internal, partner,... ...for Principal Software Engineers to help shape the technical... ...engineering, Kubernetes-based operations, automation, and reliability... ...Experience with GPU clusters, AI/ML infrastructure, Kubernetes...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior Software Engineer, DGX Cloud Production Engineering
$184k - $287.5k
Overview NVIDIA DGX Cloud is building and operating large-scale GPU infrastructure for AI research and production workloads. We are looking for Senior Software Engineers to help build the automation, tooling, and operational systems that make GPU clusters reliable, scalable...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior Ansible Automation & Platform Engineer
...Ansible Automation & Platform Engineer is a strategic member of the... ...roadmap development, efficient operation, governance, and enterprise‑... ...excellence through Agentic AI‑powered automation. This role... ...operational toil, and modernizing infrastructure operations. What You'll...
H1b
Local area
Work from home
Relocation package
General Motors
Mountain View, CA
5 days ago
Client Platform Engineer - IT Automation & End-User Systems
$155k - $185k
...A technology company in Mountain View is seeking a Client Platform Engineer to provide IT support across Windows and macOS environments. The role involves troubleshooting hardware and software issues, managing onboarding processes, and optimizing company systems like...
Otter.ai
Mountain View, CA
4 days ago
Production Engineer - Cloud Reliability & Automation
$155k - $185k
Otter.ai is seeking a talented Engineer in Mountain View, California, to join our team in building large-scale systems for effective monitoring and operations. We're looking for someone with a strong background in SRE and a proven ability to optimize systems for performance...
Otter.ai
Mountain View, CA
3 days ago
Senior Cloud Automation Engineer - AI & ML Ops
CrowdStrike, Inc. is seeking a Cloud Software Engineer to join the Falcon Complete AI Engineering Team in Sunnyvale, California. In this role, you will design, build, and deploy distributed cloud ecosystems using technologies such as Golang and Python. The ideal candidate...
CrowdStrike, Inc.
Sunnyvale, CA
4 days ago
Ansible Automation and Platform Engineer
...Ansible Automation & Platform Engineer is a strategic member of the... ...Platform (AAP)—responsible for its operations, roadmap execution,... ...operational excellence through Agentic AI‑powered automation. This role... ...toil, and modernizing infrastructure operations. What You'll Do...
H1b
Local area
Work from home
Relocation package
General Motors
Mountain View, CA
5 days ago
Senior Software Engineer, DGX Cloud Production Automation
$184k - $356.5k
NVIDIA Corporation is seeking a Senior Software Engineer for DGX Cloud Production Engineering in Santa Clara, CA. You will play a critical role in building and operating large-scale GPU infrastructure for AI workloads, focusing on automation, tooling, and operational systems...
NVIDIA Corporation
Santa Clara, CA
3 days ago
Senior QA Automation Engineer AI Network Platform
$168k - $322k
...NVIDIA Gruppe is looking for an experienced Senior QA Automation Engineer to join our Network AI platform team in Santa Clara, California. This role involves manual and automated testing to ensure quality in AI/ML-powered network solutions. The ideal candidate will have...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior Software Engineer, Intelligent Automation & Developer Platforms
$181.1k - $318.4k
Senior Software Engineer, Intelligent Automation & Developer Platforms... ...advertising platforms that operate at massive scale across Apple... ...role focused on building AI/LLM-powered internal tooling... ...scalable automation, developer infrastructure, and full‑stack engineering platforms...
Relocation
Apple Inc.
Cupertino, CA
2 days ago
Senior QA Automation Engineer Network AI Platform
$168k - $270.25k
...We are seeking an experienced Senior QA Automation Engineer to join our Network AI platform team. This role combines manual testing expertise with... ...architectures with AI/ML correlation engines using multiple network operating systems via WebUI, REST APIs, CLI, and shell interfaces....
NVIDIA Gruppe
Santa Clara, CA
3 days ago
Network Operations Engineer - Mountain View, CA
...Our enterprise-level client is seeking to add a Network Operations Engineer to the team in Mountain View, CA. Please see below for full details- Job Notes: -- 3-month contract / extensions possible with good performance. -- Onsite in Mountain View, CA 94041...
Hourly pay
Contract work
For contractors
Merge IT LLC
Mountain View, CA
2 days ago
IT Infrastructure & Operations Engineer
$108k - $125k
Decisive Point is seeking a multifaceted IT Operations Engineer to enhance its technology architecture. The role involves developing scalable IT infrastructure and providing technical support across systems such as Linux, Windows, and Mac. Ideal candidates will have over...
Decisive Point
Sunnyvale, CA
4 days ago
Senior AI Automation & Developer Platforms Engineer
$181.1k - $318.4k
Apple Inc. is looking for a Senior Software Engineer in Cupertino to build intelligent automation frameworks and developer tools. You will collaborate with the Business Integration Testing team to enhance engineering productivity across Apple Ads. The role requires 8+...
Apple Inc.
Cupertino, CA
2 days ago
Senior Storage Production Engineer - DGX Cloud
$176k - $276k
Production engineering is a field that involves crafting, building,... ...latency data access for HPC and AI/ML workloads. Storage... ...focused on automating storage operations, improving data access efficiency... ...Maintain production storage infrastructure by supervising availability,...
Full time
Flexible hours
NVIDIA
Santa Clara, CA
2 days ago
Staff AI-GRC Platform Engineer: CCM & Automation
$207k - $300k
Google Inc. seeks a Staff Software Engineer to develop AI-powered Governance, Risk, and Compliance automation. Ideal candidates should have extensive experience in software development and machine learning, and will be responsible for defining technical strategies and...
Google Inc.
Sunnyvale, CA
5 days ago
Network Operations Engineer
The Network Operations Engineer is responsible for maintaining, supporting, and enhancing network infrastructure through safe change execution, incident response, and operational monitoring. This role requires strong analytical and troubleshooting skills, the ability to...
Night shift
Compunnel, Inc.
Sunnyvale, CA
4 days ago
Security Operations Engineer II
$109k - $160k
...Security Operations Engineer II Livingston, NJ CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology... ...global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise...
Permanent employment
Temporary work
Casual work
Work at office
Flexible hours
Night shift
Weekend work
CoreWeave
Sunnyvale, CA
1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to AI Infrastructure Operations Engineer. Be the first to apply!