Kernel Reliability Engineering Manager

Dormont Manufacturing Co

Cerebras Systems builds the world’s largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. Cerebras’ current customers include top model labs, global enterprises, and cutting-edge AI-native startups. OpenAI recently announced a multi-year partnership with Cerebras , to deploy 750 megawatts of scale, transforming key workloads with ultra high-speed inference. Thanks to the groundbreaking wafer-scale architecture, Cerebras Inference offers the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services. This order of magnitude increase in speed is transforming the user experience of AI applications, unlocking real-time iteration and increasing intelligence via additional agentic computation. The Role We’re looking for a deeply technical, hands-on engineering leader for our on-field Kernel Reliability team. You will lead a high performing team to tackle a critical challenge: improving the reliability of our advanced compute clusters and the underlying inference, training, and internal production services. In this role, you’ll set the technical vision while staying close to the code and designing solutions that will scale to our exponentially growing system production and software service offerings. If you have proven expertise in software or hardware reliability, diagnostic tool building, or failure analysis and debugging, we want to hear from you. Responsibilities Provide hands-on technical leadership, owning the technical vision and roadmap for the kernel-centric reliability of our internal and customer-facing systems Assist System and Cluster Operations teams on reducing system and service downtime after failure by providing tooling and manual intervention for failure analysis and diagnostic Work with the Debug Team to enhance debug tools with the goal of speeding up failure analysis Collaborate with SW teams to improve the software stack, including Kernels, to improve on-field debugging and failure analysis Work with the ASIC an HW architecture teams to codesign the next generation architectures with reliability and ease of debug in mind Lead, mentor, and grow a high-caliber team of engineers, fostering a culture of technical excellence and rapid execution. Skills & Qualifications 6+ years in software engineering, with 3+ years leading teams in SW/HW reliability, debug, diagnostic, failure analysis or related fields Expertise in parallel and distributed programming (message passing, multicore, GPU, embeded, etc.), debug and diagnostic tool development or expert usage (debuggers, core dump handling, code sanitizers, etc.), experience debugging distributed and parallel applications (deadlocks, livelocks, race conditions, etc.), deep understanding of computer architectures (instruction pipelining, multithreading, networking, etc.) Operations & Monitoring: Strong background in monitoring and reliability engineering (incident response, post-mortem analysis, etc.) Leadership & Collaboration: Demonstrated ability to recruit and retain high-performing teams, mentor engineers, and partner cross-functionally to deliver customer-facing products. Why Join Cerebras People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we’ve reached an inflection point in our business. Members of our team tell us there are five main reasons they joined Cerebras: Build a breakthrough AI platform beyond the constraints of the GPU. Publish and open source their cutting-edge AI research. Work on one of the fastest AI supercomputers in the world. Enjoy job stability with startup vitality. Our simple, non-corporate work culture that respects individual beliefs. Read our blog: Five Reasons to Join Cerebras in 2026. Apply today and become part of the forefront of groundbreaking advancements in AI! _Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer._ We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them. This website or its third-party tools process personal data. For more details, click here to review our CCPA disclosure notice. #J-18808-Ljbffr Dormont Manufacturing Co

Apply

Vacancy posted 4 days ago

Similar jobs that could be interesting for youBased on the Kernel Reliability Engineering Manager in Sunnyvale, CA vacancy

Kernel Reliability Engineer — Debug High-Impact AI Compute
Cerebras Systems is seeking a deeply technical software engineer for its Kernel Reliability team in Sunnyvale, California. This role involves enhancing the reliability of advanced compute clusters. The ideal candidate will have strong programming skills in C/C++ and Python...
Suggested
Dormont Manufacturing Co
Sunnyvale, CA
4 days ago
SRE Engineering Manager: Lead Global Uptime & Reliability
$207k - $300k
A leading technology company in Sunnyvale, CA is seeking a Software Engineering Manager II for Site Reliability Engineering. You'll lead a team to ensure uptime and optimize the availability, scalability, and performance of key services. With a focus on automation and system...
Suggested
Google Inc.
Sunnyvale, CA
22 hours ago
Principal Engineer, CUDA UMD - GPU Kernel Scheduling
$272k - $431.25k
...we need to see BS or MS degree in Computer Science, Electrical Engineering or related field (or equivalent experience) Strong C and C++... ...Knowledge of memory coherence and consistency models Background with kernel mode development Experience with Linux Systems Software...
Suggested
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior Manager, Site Reliability Engineering
$200k - $322k
...environment, where NVIDIANs are inspired to excel and make a profound global impact. NVIDIA is seeking a Senior Manager of Site Reliability Engineering to lead and reshape how IT operations function at scale. This role goes beyond traditional service management to build...
Suggested
NVIDIA
Santa Clara, CA
22 hours ago
Site Reliability Engineering Manager, Google Distributed Cloud
$207k - $300k
Site Reliability Engineering Manager, Google Distributed Cloud Google Sunnyvale, CA, USA Bachelor’s degree in Computer Science, a related field, or equivalent practical experience. 8 years of experience building or managing distributed systems or cloud infrastructure,...
Suggested
Full time
Google Inc.
Sunnyvale, CA
1 day ago
Principal Site Reliability Engineer
$202k - $247k
Job Category Site Reliability Engineering Posting Date 11/18/2025, 12:24 AM Locations Santa Clara, CA, United States Job Schedule Full time... ...spanning our cloud accounts, network/connectivity, workload management, observability, and storage services. We build tooling to...
Full time
Worldwide
Fortinet, Inc.
Santa Clara, CA
3 days ago
Principal Site Reliability Engineer ( U.S Citizenship required )
$151.6k - $245.3k
...Networks runs a large infrastructure and is one of the largest GCP customers. As a Principal Site Reliability Engineer for the ADEM (Autonomous Digital Experience Management) team, you will be part of a team supporting the services that provide end-to-end visibility and...
Full time
Work at office
Visa sponsorship
Work visa
Palo Alto Networks, Inc.
Santa Clara, CA
4 days ago
Systems Engineer, Kernel
$165k - $242k
...Systems Engineer, Kernel Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA CoreWeave... ...our HAVOCK Team, reporting into the Manager of Systems Engineering. In this role,... ...that improves the performance and reliability of our stack. This position is ideal...
Permanent employment
Temporary work
Casual work
Work at office
Remote work
Flexible hours
CoreWeave
Sunnyvale, CA
3 days ago
Principal SRE Engineer (US Citizen)
$147k - $237.5k
...Qualifications Must be a US Citizen. BS/MS in Computer Science/Engineering or equivalent training, education, and experience in... ...Deep knowledge of cloud security architecture, vulnerability management, and networking protocols (IP Networking, VPNs, DNS, Load Balancing...
Full time
Work at office
Visa sponsorship
Work visa
Palo Alto Networks, Inc.
Santa Clara, CA
2 days ago
Principal Site Reliability Engineer (CIPE)
...role requires US Citizenship. Your Career As a Principal Site Reliability Engineer, you will serve as the technical authority for our cloud-... ...infrastructure‑as‑code model. Security Engineering : Implement and manage security scanning tools (Prisma Cloud, Snyk, or GKE native...
Visa sponsorship
Work visa
Shift work
Palo Alto Networks, Inc.
Santa Clara, CA
3 days ago
Manager, Software Engineering-Kernels
...Manager, Software Engineering-Kernels At d-Matrix, we are focused on unleashing the potential of generative AI to power the transformation of technology. We are at the forefront of software and hardware innovation, pushing the boundaries of what is possible. Our culture...
Work experience placement
3 days per week
D-Matrix
Santa Clara, CA
3 days ago
Software Engineering Manager II, Site Reliability Engineering
$207k - $300k
Software Engineering Manager II, Site Reliability Engineering corporate_fare Google Sunnyvale, CA, USA Bachelor’s degree in Computer Science, a related field, or equivalent practical experience. 8 years of experience with software development in one or more programming...
Full time
Google Inc.
Sunnyvale, CA
22 hours ago
Site Reliability Manager, Site Reliability Engineering
$207k - $300k
Site Reliability Manager, Site Reliability Engineering Experience owning outcomes and decision making, solving ambiguous problems and influencing stakeholders; deep expertise in domain. Apply Qualifications Bachelor’s degree in Computer Science, a related field, or equivalent...
Full time
Google Inc.
Mountain View, CA
1 day ago
Senior OS/Kernel Engineer - Real-Time Automotive Systems
$143.2k - $186k
1600 NIO USA, Inc. is seeking a Senior OS / Kernel Engineer for the SkyOS team to design and develop full-domain vehicle operating systems. Candidates should have a strong background in operating system internals and proficiency in languages like C or Rust. The position...
1600 NIO USA, Inc.
San Jose, CA
4 days ago
Fleet Reliability Engineer
$110k - $150k
...recognize the importance of flexibility and trust our employees to manage their schedules responsibly. This may include occasional... ...accommodate family commitments. About the role As a Fleet Reliability Engineer at Applied Intuition, you will play a critical role in...
Full time
For contractors
For subcontractor
Casual work
Work at office
Remote work
Day shift
Applied Intuition
Sunnyvale, CA
1 day ago
Senior IC Packaging Reliability Engineer - 2.5D/3D & BGA
NVIDIA Gruppe is seeking an experienced professional to lead package-level reliability for semiconductor products in Santa Clara, California. The ideal candidate will possess a Master’s or PhD in a related field, along with 8+ years of hands-on experience in IC packaging...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Fleet Reliability Engineer — Automotive HW/SW & Testing
$110k - $150k
Applied Intuition is looking for a Fleet Reliability Engineer in Sunnyvale, California. You will enhance vehicle fleet reliability by optimizing hardware and software systems. Ideal candidates have 5+ years of automotive experience and a relevant degree. Compensation includes...
Decisive Point
Sunnyvale, CA
2 days ago
Principal Cloud SRE & Automation Engineer
Palo Alto Networks, Inc. is seeking a Principal Site Reliability Engineer in Santa Clara, CA. This role involves supporting a large infrastructure and ensuring applications are production-ready, scalable, and reliable. You'll work closely with developers and researchers...
Palo Alto Networks, Inc.
Santa Clara, CA
3 days ago
Senior Automotive Reliability & EMC Engineer - DV/PV Lead
$125.8k - $170.2k
A technology firm specializing in autonomous vehicle solutions is seeking a Senior Automotive Reliability & EMC Test Engineer to lead validation and compliance testing for LiDAR systems. The role involves defining testing strategies, leading environmental tests, and ensuring...
Aeva, Inc.
Mountain View, CA
2 days ago
Senior Automotive Reliability & EMC Test engineer
$125.8k - $170.2k
...ground up at silicon photonics scale for mass‑market applications. Role Overview Aeva is seeking a Senior Automotive Reliability & EMC Test Engineer to own validation and compliance testing for automotive LiDAR systems through DV, PV, and production readiness. This role...
Flexible hours
Clutch Canada
Mountain View, CA
1 day ago
Founding Security Reliability Engineer
$150k - $250k
...As our Founding Security Reliability Engineer at Charta Health, you'll pioneer the application of Site Reliability Engineering principles... ...their billing needs and highly sensitive data are expertly managed and continuously protected through robust security reliability...
Charta Health
Sunnyvale, CA
4 days ago
Linux Kernel Engineer - Enterprise Storage Systems
A leading technology company is seeking a Linux Kernel Software Engineer to develop and optimize the Linux kernel for enterprise storage solutions. This role requires deep experience in kernel development and a strong foundation in computer systems. You will collaborate...
Pure Storage, Inc.
Santa Clara, CA
1 day ago
Senior Linux Kernel Systems Engineer for CSP Deployments
NVIDIA Gruppe is seeking a Senior Software Engineer to work on system software for datacenter products in Santa Clara, California. This... ...will have over 10 years of experience, a strong grasp on Linux kernel internals, and expertise in data center architectures. Notably,...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Module Reliability Engineer
$147.4k - $272.1k
...challenges? This position is offered in Apple’s Hardware Module Reliability Group. We guide development teams toward generating reliable... ...modes early in the design life cycle and by using our engineering knowledge to help find the best path towards mitigation. Here...
Remote work
Relocation
Overseas
Apple Inc.
Cupertino, CA
3 days ago
Display & Touch Module Reliability Engineer
$147.4k - $272.1k
Apple Inc. is looking for a Hardware Reliability Engineer for its Cupertino location. The role involves setting test specs, conducting reliability assessments, and analyzing data to guide the design process. The ideal candidate should have a bachelor’s degree in a relevant...
Apple Inc.
Cupertino, CA
3 days ago
Silicon Reliability Engineer
$159k - $231k
...Apply Bachelor's degree in Electrical Engineering, Material Science, Mechanical Engineering... ...Circuit (ASIC) device and package reliability Joint Electron Device Engineering Council... ...You will work closely with the product management and design engineering teams to define...
Full time
Google Inc.
Mountain View, CA
4 days ago
Senior Reliability Engineer - LPU Packaging
What You’ll Be Doing Own the package‑level reliability spec for assigned products Define qualification requirements and pass/fail criteria... ...requirement What We Need to See MS/PhD in Electrical Engineering, Materials Science, Mechanical Engineering, or related field,...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior Silicon Reliability Engineer
$168k - $264.5k
...human inventiveness and intelligence. Make the choice to join us today. We are seeking an outstanding candidate for Silicon Reliability Engineer who will serve as the foundry process reliability professional for NVIDIA in utilizing cutting edge technologies to deliver...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior Reliability Engineer
$136k - $218.5k
...innovator in computer graphics, PC gaming, and accelerated computing, as we step into the next era shaped by AI. As a Senior Reliability Engineer, you'll work within a focused team, developing groundbreaking products that form the future. This unique position provides...
NVIDIA Corporation
Santa Clara, CA
4 days ago
Senior Silicon Reliability Engineer — Wearout & Tests
$168k - $264.5k
NVIDIA Gruppe is seeking a Silicon Reliability Engineer to lead foundry process reliability for cutting-edge technologies. You will collaborate with teams to ensure high-performance products and develop reliability methodologies. The ideal candidate will have a PhD or...
NVIDIA Gruppe
Santa Clara, CA
4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Kernel Reliability Engineering Manager. Be the first to apply!