AI Infrastructure Operations Engineer
CEREBRAS SYSTEMS INC.
Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs.
Cerebras' current customers include top model labs, global enterprises, and cutting-edge AI-native startups. OpenAI recently announced a multi-year partnership with Cerebras, to deploy 750 megawatts of scale, transforming key workloads with ultra high-speed inference.
Thanks to the groundbreaking wafer-scale architecture, Cerebras Inference offers the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services. This order of magnitude increase in speed is transforming the user experience of AI applications, unlocking real-time iteration and increasing intelligence via additional agentic computation.
You will play a critical role in ensuring the health, performance, and availability of our infrastructure, maximizing compute capacity, and supporting our growing AI initiatives. This role requires a deep understanding of Linux-based systems, containerization technologies, and experience with monitoring and troubleshooting complex distributed systems. The ideal candidate is a proactive problem-solver with expertise in large-scale compute infrastructure, dependable and an advocate for customer success.
Responsibilities
- Manage and operate multiple advanced AI compute infrastructure clusters.
- Monitor and oversee cluster health, proactively identifying and resolving potential issues.
- Maximize compute capacity through optimization and efficient resource allocation.
- Deploy, configure, and debug container-based services using Docker.
- Provide 24/7 monitoring and support, leveraging automated tools and performing hands-on troubleshooting as needed.
- Handle engineering escalations and collaborate with other teams to resolve complex technical challenges.
- Contribute to the development and improvement of our monitoring and support processes.
- Stay up-to-date with the latest advancements in AI compute infrastructure and related technologies.
- 6-8 years of relevant experience in managing and operating complex compute infrastructure, preferably in the context of machine learning or high-performance computing.
- Strong proficiency in Python scripting for automation and system administration.
- Deep understanding of Linux-based compute systems and command-line tools.
- Extensive knowledge of Docker containers and container orchestration platforms like k8s and SLURM.
- Proven ability to troubleshoot and resolve complex technical issues in a timely and efficient manner.
- Experience with monitoring and alerting systems.
- Should have a proven track record to own and drive challenges to completion.
- Excellent communication and collaboration skills.
- Ability to work effectively in a fast-paced environment.
- Willingness to participate in a 24/7 on-call rotation.
- Operating large scale GPU clusters.
- Knowledge of technologies like Ethernet, RoCE, TCP/IP, etc. is desired.
- Knowledge of cloud computing platforms (e.g., AWS, GCP, Azure).
- Familiarity with machine learning frameworks and tools.
- Experience with cross-functional team projects.
- SF Bay Area.
- Toronto, Canada.
- Bangalore, India.
Why Join Cerebras People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we've reached an inflection point in our business. Members of our team tell us there are five main reasons they joined Cerebras:
- Build a breakthrough AI platform beyond the constraints of the GPU.
- Publish and open source their cutting-edge AI research.
- Work on one of the fastest AI supercomputers in the world.
- Enjoy job stability with startup vitality.
- Our simple, non-corporate work culture that respects individual beliefs.
Read our blog: Five Reasons to Join Cerebras in 2026.
Apply today and become part of the forefront of groundbreaking advancements in AI! Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer. We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them. This website or its third-party tools process personal data. For more details, click here to review our CCPA disclosure notice.
$90k - $110k
...CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave... ..., CoreWeave combines superior infrastructure performance with deep technical expertise... ...seeking a dedicated and detail-oriented Operations Engineer to join our HPC Networking Team. HPC...SuggestedPermanent employmentTemporary workCasual workWork at officeFlexible hours- ...A high-performance AI infrastructure company in Santa Clara is looking for an IT Helpdesk and Operations Engineer. This role involves supporting and designing IT systems, managing security protocols, and leading significant IT projects. Candidates should have 2-3+ years...Suggested
- ...General Motors is hiring a Staff Security Software Engineer for their Cybersecurity Team in Mountain View, California. This role requires... ...strategy and architecture for enterprise-scale projects. You will lead AI-driven initiatives, shape organizational practices, and mentor...Suggested
$159k - $231k
Senior Data Center Operations Engineer, Google Cloud Sunnyvale, CA, USA Qualifications Bachelor... ...Lab team manages this critical infrastructure to ensure product teams can focus on core... ...specialized development groups. The AI and Infrastructure team is redefining...SuggestedFull timeWork at officeWorldwide$248k - $396.75k
...the unlimited potential of AI to define the next era of computing... ...the performance of our infrastructure both on‑prem and cloud. Join... ...skilled Principal AI/ML Engineer to join our dynamic team to... ...architecture/standards/reuse, and operational documentation via Confluence...Suggested- A technology solutions company in Palo Alto seeks an experienced AI/ML Engineer to build and deploy GenAI-powered tools aimed at enhancing network troubleshooting. Candidates should possess a solid foundation in GenAI patterns, Python, and SQL, with 3-5 years in AI/ML...
- ...Member of Technical Staff (Sr. MTS) to join their Cloud Test team in Santa Clara, CA. Responsibilities include collaborating with engineers on product requirements, designing test plans, writing automated tests in Python or Go, and performing performance testing. Required...
$180k - $225k
...days per week Extreme’s Cloud Operations team is a group of talented engineers passionate about building highly... ...operation, as well as cloud infrastructure design and implementation. Together... ...and best practices and leverages AI and cloud service provider platforms...Work experience placementWork at officeLocal area2 days per week1 day per week- ...Incedo: Incedo is a global AI and data transformation... ...for strategy to execution, we operate at the intersection of business... ...foundation of AI & Data, digital engineering, and operations... ...engineering initiatives, automating infrastructure, and ensuring high-availability...Worldwide
$215k - $260k
...intelligence . As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of the stack — from electrons to... ...This Role: Crusoe is seeking a Staff Software Engineer to design and deliver the automation frameworks...Full timeTemporary work$168k - $322k
A leading technology firm in Santa Clara is looking for an experienced Senior QA Automation Engineer to enhance their Network AI platform. You will ensure the quality of AI/ML-powered network solutions through manual testing and Python automation. Candidates should have...$140k - $185k
...Principal Cloud Engineering and Production Operations Engineer The Principal Cloud and Production Operations... .... This role combines deep cloud infrastructure expertise with strong production... ...access security model Exposure to AI/ML infrastructure or data-driven...For subcontractorLocal area$272k - $431.25k
...NVIDIA DGX Cloud is scaling GPU infrastructure across internal, partner,... ...for Principal Software Engineers to help shape the technical... ...engineering, Kubernetes-based operations, automation, and reliability... ...Experience with GPU clusters, AI/ML infrastructure, Kubernetes...$184k - $287.5k
Overview NVIDIA DGX Cloud is building and operating large-scale GPU infrastructure for AI research and production workloads. We are looking for Senior Software Engineers to help build the automation, tooling, and operational systems that make GPU clusters reliable, scalable...- ...Ansible Automation & Platform Engineer is a strategic member of the... ...roadmap development, efficient operation, governance, and enterprise‑... ...excellence through Agentic AI‑powered automation. This role... ...operational toil, and modernizing infrastructure operations. What You'll...H1bLocal areaWork from homeRelocation package
$155k - $185k
...A technology company in Mountain View is seeking a Client Platform Engineer to provide IT support across Windows and macOS environments. The role involves troubleshooting hardware and software issues, managing onboarding processes, and optimizing company systems like...$155k - $185k
Otter.ai is seeking a talented Engineer in Mountain View, California, to join our team in building large-scale systems for effective monitoring and operations. We're looking for someone with a strong background in SRE and a proven ability to optimize systems for performance...- CrowdStrike, Inc. is seeking a Cloud Software Engineer to join the Falcon Complete AI Engineering Team in Sunnyvale, California. In this role, you will design, build, and deploy distributed cloud ecosystems using technologies such as Golang and Python. The ideal candidate...
- ...Ansible Automation & Platform Engineer is a strategic member of the... ...Platform (AAP)—responsible for its operations, roadmap execution,... ...operational excellence through Agentic AI‑powered automation. This role... ...toil, and modernizing infrastructure operations. What You'll Do...H1bLocal areaWork from homeRelocation package
$184k - $356.5k
NVIDIA Corporation is seeking a Senior Software Engineer for DGX Cloud Production Engineering in Santa Clara, CA. You will play a critical role in building and operating large-scale GPU infrastructure for AI workloads, focusing on automation, tooling, and operational systems...$168k - $322k
...NVIDIA Gruppe is looking for an experienced Senior QA Automation Engineer to join our Network AI platform team in Santa Clara, California. This role involves manual and automated testing to ensure quality in AI/ML-powered network solutions. The ideal candidate will have...$181.1k - $318.4k
Senior Software Engineer, Intelligent Automation & Developer Platforms... ...advertising platforms that operate at massive scale across Apple... ...role focused on building AI/LLM-powered internal tooling... ...scalable automation, developer infrastructure, and full‑stack engineering platforms...Relocation$168k - $270.25k
...We are seeking an experienced Senior QA Automation Engineer to join our Network AI platform team. This role combines manual testing expertise with... ...architectures with AI/ML correlation engines using multiple network operating systems via WebUI, REST APIs, CLI, and shell interfaces....- ...Our enterprise-level client is seeking to add a Network Operations Engineer to the team in Mountain View, CA. Please see below for full details- Job Notes: -- 3-month contract / extensions possible with good performance. -- Onsite in Mountain View, CA 94041...Hourly payContract workFor contractors
$108k - $125k
Decisive Point is seeking a multifaceted IT Operations Engineer to enhance its technology architecture. The role involves developing scalable IT infrastructure and providing technical support across systems such as Linux, Windows, and Mac. Ideal candidates will have over...$181.1k - $318.4k
Apple Inc. is looking for a Senior Software Engineer in Cupertino to build intelligent automation frameworks and developer tools. You will collaborate with the Business Integration Testing team to enhance engineering productivity across Apple Ads. The role requires 8+...$176k - $276k
Production engineering is a field that involves crafting, building,... ...latency data access for HPC and AI/ML workloads. Storage... ...focused on automating storage operations, improving data access efficiency... ...Maintain production storage infrastructure by supervising availability,...Full timeFlexible hours$207k - $300k
Google Inc. seeks a Staff Software Engineer to develop AI-powered Governance, Risk, and Compliance automation. Ideal candidates should have extensive experience in software development and machine learning, and will be responsible for defining technical strategies and...- The Network Operations Engineer is responsible for maintaining, supporting, and enhancing network infrastructure through safe change execution, incident response, and operational monitoring. This role requires strong analytical and troubleshooting skills, the ability to...Night shift
$109k - $160k
...Security Operations Engineer II Livingston, NJ CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology... ...global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise...Permanent employmentTemporary workCasual workWork at officeFlexible hoursNight shiftWeekend work
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to AI Infrastructure Operations Engineer. Be the first to apply!
- ai engineer remote Sunnyvale, CA
- ai prompt engineer Sunnyvale, CA
- senior ai engineer Sunnyvale, CA
- machine learning ai engineer Sunnyvale, CA
- ai engineer Sunnyvale, CA
- ai developer Sunnyvale, CA
- ai ml engineer Sunnyvale, CA
- ai research engineer Sunnyvale, CA
- infrastructure automation engineer Sunnyvale, CA
- security infrastructure engineer Sunnyvale, CA


