Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

AI Infrastructure Operations Engineer

CEREBRAS SYSTEMS INC.

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs.


Cerebras' current customers include top model labs, global enterprises, and cutting-edge AI-native startups. OpenAI recently announced a multi-year partnership with Cerebras, to deploy 750 megawatts of scale, transforming key workloads with ultra high-speed inference.


Thanks to the groundbreaking wafer-scale architecture, Cerebras Inference offers the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services. This order of magnitude increase in speed is transforming the user experience of AI applications, unlocking real-time iteration and increasing intelligence via additional agentic computation.

About The Role

We are seeking a highly skilled and experienced AI Infrastructure Operations Engineer to manage and operate our cutting-edge machine learning compute clusters. These clusters would provide the candidate an opportunity to work with the world's largest computer chip, the Wafer-Scale Engine (WSE), and the systems that harness its unparalleled power.


You will play a critical role in ensuring the health, performance, and availability of our infrastructure, maximizing compute capacity, and supporting our growing AI initiatives. This role requires a deep understanding of Linux-based systems, containerization technologies, and experience with monitoring and troubleshooting complex distributed systems. The ideal candidate is a proactive problem-solver with expertise in large-scale compute infrastructure, dependable and an advocate for customer success.
Responsibilities
  • Manage and operate multiple advanced AI compute infrastructure clusters.
  • Monitor and oversee cluster health, proactively identifying and resolving potential issues.
  • Maximize compute capacity through optimization and efficient resource allocation.
  • Deploy, configure, and debug container-based services using Docker.
  • Provide 24/7 monitoring and support, leveraging automated tools and performing hands-on troubleshooting as needed.
  • Handle engineering escalations and collaborate with other teams to resolve complex technical challenges.
  • Contribute to the development and improvement of our monitoring and support processes.
  • Stay up-to-date with the latest advancements in AI compute infrastructure and related technologies.
Skills And Requirements
  • 6-8 years of relevant experience in managing and operating complex compute infrastructure, preferably in the context of machine learning or high-performance computing.
  • Strong proficiency in Python scripting for automation and system administration.
  • Deep understanding of Linux-based compute systems and command-line tools.
  • Extensive knowledge of Docker containers and container orchestration platforms like k8s and SLURM.
  • Proven ability to troubleshoot and resolve complex technical issues in a timely and efficient manner.
  • Experience with monitoring and alerting systems.
  • Should have a proven track record to own and drive challenges to completion.
  • Excellent communication and collaboration skills.
  • Ability to work effectively in a fast-paced environment.
  • Willingness to participate in a 24/7 on-call rotation.
Preferred Skills And Requirements
  • Operating large scale GPU clusters.
  • Knowledge of technologies like Ethernet, RoCE, TCP/IP, etc. is desired.
  • Knowledge of cloud computing platforms (e.g., AWS, GCP, Azure).
  • Familiarity with machine learning frameworks and tools.
  • Experience with cross-functional team projects.
Location
  • SF Bay Area.
  • Toronto, Canada.
  • Bangalore, India.

Why Join Cerebras

People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we've reached an inflection point in our business. Members of our team tell us there are five main reasons they joined Cerebras:
  1. Build a breakthrough AI platform beyond the constraints of the GPU.
  2. Publish and open source their cutting-edge AI research.
  3. Work on one of the fastest AI supercomputers in the world.
  4. Enjoy job stability with startup vitality.
  5. Our simple, non-corporate work culture that respects individual beliefs.

Read our blog: Five Reasons to Join Cerebras in 2026.
Apply today and become part of the forefront of groundbreaking advancements in AI!

Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer. We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them.

This website or its third-party tools process personal data. For more details, click here to review our CCPA disclosure notice.
Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the AI Infrastructure Operations Engineer in Sunnyvale, CA vacancy
  • $90k - $110k

     ...CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave...  ..., CoreWeave combines superior infrastructure performance with deep technical expertise...  ...seeking a dedicated and detail-oriented Operations Engineer to join our HPC Networking Team. HPC... 
    Suggested
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    26 days ago
  •  ...A high-performance AI infrastructure company in Santa Clara is looking for an IT Helpdesk and Operations Engineer. This role involves supporting and designing IT systems, managing security protocols, and leading significant IT projects. Candidates should have 2-3+ years... 
    Suggested

    Nexthop Systems Inc

    Santa Clara, CA
    4 days ago
  •  ...General Motors is hiring a Staff Security Software Engineer for their Cybersecurity Team in Mountain View, California. This role requires...  ...strategy and architecture for enterprise-scale projects. You will lead AI-driven initiatives, shape organizational practices, and mentor... 
    Suggested

    General Motors

    Mountain View, CA
    3 days ago
  • $159k - $231k

    Senior Data Center Operations Engineer, Google Cloud Sunnyvale, CA, USA Qualifications Bachelor...  ...Lab team manages this critical infrastructure to ensure product teams can focus on core...  ...specialized development groups. The AI and Infrastructure team is redefining... 
    Suggested
    Full time
    Work at office
    Worldwide

    Google Inc.

    Sunnyvale, CA
    4 days ago
  • $248k - $396.75k

     ...the unlimited potential of AI to define the next era of computing...  ...the performance of our infrastructure both on‑prem and cloud. Join...  ...skilled Principal AI/ML Engineer to join our dynamic team to...  ...architecture/standards/reuse, and operational documentation via Confluence... 
    Suggested

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • A technology solutions company in Palo Alto seeks an experienced AI/ML Engineer to build and deploy GenAI-powered tools aimed at enhancing network troubleshooting. Candidates should possess a solid foundation in GenAI patterns, Python, and SQL, with 3-5 years in AI/ML... 

    Robotics Prcocess Automation, LLC

    Palo Alto, CA
    4 days ago
  •  ...Member of Technical Staff (Sr. MTS) to join their Cloud Test team in Santa Clara, CA. Responsibilities include collaborating with engineers on product requirements, designing test plans, writing automated tests in Python or Go, and performing performance testing. Required... 

    Aviatrix

    Santa Clara, CA
    1 day ago
  • $180k - $225k

     ...days per week    Extreme’s Cloud Operations team is a group of talented engineers passionate about building highly...  ...operation, as well as cloud infrastructure design and implementation. Together...  ...and best practices and leverages AI and cloud service provider platforms... 
    Work experience placement
    Work at office
    Local area
    2 days per week
    1 day per week

    Extreme Networks

    San Jose, CA
    5 days ago
  •  ...Incedo: Incedo is a global AI and data transformation...  ...for strategy to execution, we operate at the intersection of business...  ...foundation of AI & Data, digital engineering, and operations...  ...engineering initiatives, automating infrastructure, and ensuring high-availability... 
    Worldwide

    Qode

    San Jose, CA
    9 days ago
  • $215k - $260k

     ...intelligence . As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of the stack — from electrons to...  ...This Role: Crusoe is seeking a Staff Software Engineer to design and deliver the automation frameworks... 
    Full time
    Temporary work

    Crusoe

    Sunnyvale, CA
    12 days ago
  • $168k - $322k

    A leading technology firm in Santa Clara is looking for an experienced Senior QA Automation Engineer to enhance their Network AI platform. You will ensure the quality of AI/ML-powered network solutions through manual testing and Python automation. Candidates should have... 

    NVIDIA Corporation

    Santa Clara, CA
    5 days ago
  • $140k - $185k

     ...Principal Cloud Engineering and Production Operations Engineer The Principal Cloud and Production Operations...  .... This role combines deep cloud infrastructure expertise with strong production...  ...access security model Exposure to AI/ML infrastructure or data-driven... 
    For subcontractor
    Local area

    A10 Networks

    San Jose, CA
    2 days ago
  • $272k - $431.25k

     ...NVIDIA DGX Cloud is scaling GPU infrastructure across internal, partner,...  ...for Principal Software Engineers to help shape the technical...  ...engineering, Kubernetes-based operations, automation, and reliability...  ...Experience with GPU clusters, AI/ML infrastructure, Kubernetes... 

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $184k - $287.5k

    Overview NVIDIA DGX Cloud is building and operating large-scale GPU infrastructure for AI research and production workloads. We are looking for Senior Software Engineers to help build the automation, tooling, and operational systems that make GPU clusters reliable, scalable... 

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  •  ...Ansible Automation & Platform Engineer is a strategic member of the...  ...roadmap development, efficient operation, governance, and enterprise‑...  ...excellence through Agentic AI‑powered automation. This role...  ...operational toil, and modernizing infrastructure operations. What You'll... 
    H1b
    Local area
    Work from home
    Relocation package

    General Motors

    Mountain View, CA
    5 days ago
  • $155k - $185k

     ...A technology company in Mountain View is seeking a Client Platform Engineer to provide IT support across Windows and macOS environments. The role involves troubleshooting hardware and software issues, managing onboarding processes, and optimizing company systems like... 

    Otter.ai

    Mountain View, CA
    4 days ago
  • $155k - $185k

    Otter.ai is seeking a talented Engineer in Mountain View, California, to join our team in building large-scale systems for effective monitoring and operations. We're looking for someone with a strong background in SRE and a proven ability to optimize systems for performance... 

    Otter.ai

    Mountain View, CA
    3 days ago
  • CrowdStrike, Inc. is seeking a Cloud Software Engineer to join the Falcon Complete AI Engineering Team in Sunnyvale, California. In this role, you will design, build, and deploy distributed cloud ecosystems using technologies such as Golang and Python. The ideal candidate... 

    CrowdStrike, Inc.

    Sunnyvale, CA
    4 days ago
  •  ...Ansible Automation & Platform Engineer is a strategic member of the...  ...Platform (AAP)—responsible for its operations, roadmap execution,...  ...operational excellence through Agentic AI‑powered automation. This role...  ...toil, and modernizing infrastructure operations. What You'll Do... 
    H1b
    Local area
    Work from home
    Relocation package

    General Motors

    Mountain View, CA
    5 days ago
  • $184k - $356.5k

    NVIDIA Corporation is seeking a Senior Software Engineer for DGX Cloud Production Engineering in Santa Clara, CA. You will play a critical role in building and operating large-scale GPU infrastructure for AI workloads, focusing on automation, tooling, and operational systems... 

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago
  • $168k - $322k

     ...NVIDIA Gruppe is looking for an experienced Senior QA Automation Engineer to join our Network AI platform team in Santa Clara, California. This role involves manual and automated testing to ensure quality in AI/ML-powered network solutions. The ideal candidate will have... 

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $181.1k - $318.4k

    Senior Software Engineer, Intelligent Automation & Developer Platforms...  ...advertising platforms that operate at massive scale across Apple...  ...role focused on building AI/LLM-powered internal tooling...  ...scalable automation, developer infrastructure, and full‑stack engineering platforms... 
    Relocation

    Apple Inc.

    Cupertino, CA
    2 days ago
  • $168k - $270.25k

     ...We are seeking an experienced Senior QA Automation Engineer to join our Network AI platform team. This role combines manual testing expertise with...  ...architectures with AI/ML correlation engines using multiple network operating systems via WebUI, REST APIs, CLI, and shell interfaces.... 

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  •  ...Our enterprise-level client is seeking to add a Network Operations Engineer to the team in Mountain View, CA. Please see below for full details- Job Notes: -- 3-month contract / extensions possible with good performance. -- Onsite in Mountain View, CA 94041... 
    Hourly pay
    Contract work
    For contractors

    Merge IT LLC

    Mountain View, CA
    2 days ago
  • $108k - $125k

    Decisive Point is seeking a multifaceted IT Operations Engineer to enhance its technology architecture. The role involves developing scalable IT infrastructure and providing technical support across systems such as Linux, Windows, and Mac. Ideal candidates will have over... 

    Decisive Point

    Sunnyvale, CA
    4 days ago
  • $181.1k - $318.4k

    Apple Inc. is looking for a Senior Software Engineer in Cupertino to build intelligent automation frameworks and developer tools. You will collaborate with the Business Integration Testing team to enhance engineering productivity across Apple Ads. The role requires 8+... 

    Apple Inc.

    Cupertino, CA
    2 days ago
  • $176k - $276k

    Production engineering is a field that involves crafting, building,...  ...latency data access for HPC and AI/ML workloads. Storage...  ...focused on automating storage operations, improving data access efficiency...  ...Maintain production storage infrastructure by supervising availability,... 
    Full time
    Flexible hours

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $207k - $300k

    Google Inc. seeks a Staff Software Engineer to develop AI-powered Governance, Risk, and Compliance automation. Ideal candidates should have extensive experience in software development and machine learning, and will be responsible for defining technical strategies and... 

    Google Inc.

    Sunnyvale, CA
    5 days ago
  • The Network Operations Engineer is responsible for maintaining, supporting, and enhancing network infrastructure through safe change execution, incident response, and operational monitoring. This role requires strong analytical and troubleshooting skills, the ability to... 
    Night shift

    Compunnel, Inc.

    Sunnyvale, CA
    4 days ago
  • $109k - $160k

     ...Security Operations Engineer II Livingston, NJ CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology...  ...global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise... 
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Flexible hours
    Night shift
    Weekend work

    CoreWeave

    Sunnyvale, CA
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to AI Infrastructure Operations Engineer. Be the first to apply!