Kernel Reliability Engineering Manager
Dormont Manufacturing Co
Cerebras Systems builds the world’s largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. Cerebras’ current customers include top model labs, global enterprises, and cutting-edge AI-native startups. OpenAI recently announced a multi-year partnership with Cerebras , to deploy 750 megawatts of scale, transforming key workloads with ultra high-speed inference. Thanks to the groundbreaking wafer-scale architecture, Cerebras Inference offers the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services. This order of magnitude increase in speed is transforming the user experience of AI applications, unlocking real-time iteration and increasing intelligence via additional agentic computation. The Role We’re looking for a deeply technical, hands-on engineering leader for our on-field Kernel Reliability team. You will lead a high performing team to tackle a critical challenge: improving the reliability of our advanced compute clusters and the underlying inference, training, and internal production services. In this role, you’ll set the technical vision while staying close to the code and designing solutions that will scale to our exponentially growing system production and software service offerings. If you have proven expertise in software or hardware reliability, diagnostic tool building, or failure analysis and debugging, we want to hear from you. Responsibilities Provide hands-on technical leadership, owning the technical vision and roadmap for the kernel-centric reliability of our internal and customer-facing systems Assist System and Cluster Operations teams on reducing system and service downtime after failure by providing tooling and manual intervention for failure analysis and diagnostic Work with the Debug Team to enhance debug tools with the goal of speeding up failure analysis Collaborate with SW teams to improve the software stack, including Kernels, to improve on-field debugging and failure analysis Work with the ASIC an HW architecture teams to codesign the next generation architectures with reliability and ease of debug in mind Lead, mentor, and grow a high-caliber team of engineers, fostering a culture of technical excellence and rapid execution. Skills & Qualifications 6+ years in software engineering, with 3+ years leading teams in SW/HW reliability, debug, diagnostic, failure analysis or related fields Expertise in parallel and distributed programming (message passing, multicore, GPU, embeded, etc.), debug and diagnostic tool development or expert usage (debuggers, core dump handling, code sanitizers, etc.), experience debugging distributed and parallel applications (deadlocks, livelocks, race conditions, etc.), deep understanding of computer architectures (instruction pipelining, multithreading, networking, etc.) Operations & Monitoring: Strong background in monitoring and reliability engineering (incident response, post-mortem analysis, etc.) Leadership & Collaboration: Demonstrated ability to recruit and retain high-performing teams, mentor engineers, and partner cross-functionally to deliver customer-facing products. Why Join Cerebras People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we’ve reached an inflection point in our business. Members of our team tell us there are five main reasons they joined Cerebras: Build a breakthrough AI platform beyond the constraints of the GPU. Publish and open source their cutting-edge AI research. Work on one of the fastest AI supercomputers in the world. Enjoy job stability with startup vitality. Our simple, non-corporate work culture that respects individual beliefs. Read our blog: Five Reasons to Join Cerebras in 2026. Apply today and become part of the forefront of groundbreaking advancements in AI! _Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer._ We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them. This website or its third-party tools process personal data. For more details, click here to review our CCPA disclosure notice. #J-18808-Ljbffr Dormont Manufacturing Co
- Cerebras Systems is seeking a deeply technical software engineer for its Kernel Reliability team in Sunnyvale, California. This role involves enhancing the reliability of advanced compute clusters. The ideal candidate will have strong programming skills in C/C++ and Python...Suggested
$207k - $300k
A leading technology company in Sunnyvale, CA is seeking a Software Engineering Manager II for Site Reliability Engineering. You'll lead a team to ensure uptime and optimize the availability, scalability, and performance of key services. With a focus on automation and system...Suggested$272k - $431.25k
...we need to see BS or MS degree in Computer Science, Electrical Engineering or related field (or equivalent experience) Strong C and C++... ...Knowledge of memory coherence and consistency models Background with kernel mode development Experience with Linux Systems Software...Suggested$200k - $322k
...environment, where NVIDIANs are inspired to excel and make a profound global impact. NVIDIA is seeking a Senior Manager of Site Reliability Engineering to lead and reshape how IT operations function at scale. This role goes beyond traditional service management to build...Suggested$207k - $300k
Site Reliability Engineering Manager, Google Distributed Cloud Google Sunnyvale, CA, USA Bachelor’s degree in Computer Science, a related field, or equivalent practical experience. 8 years of experience building or managing distributed systems or cloud infrastructure,...SuggestedFull time$202k - $247k
Job Category Site Reliability Engineering Posting Date 11/18/2025, 12:24 AM Locations Santa Clara, CA, United States Job Schedule Full time... ...spanning our cloud accounts, network/connectivity, workload management, observability, and storage services. We build tooling to...Full timeWorldwide$151.6k - $245.3k
...Networks runs a large infrastructure and is one of the largest GCP customers. As a Principal Site Reliability Engineer for the ADEM (Autonomous Digital Experience Management) team, you will be part of a team supporting the services that provide end-to-end visibility and...Full timeWork at officeVisa sponsorshipWork visa$165k - $242k
...Systems Engineer, Kernel Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA CoreWeave... ...our HAVOCK Team, reporting into the Manager of Systems Engineering. In this role,... ...that improves the performance and reliability of our stack. This position is ideal...Permanent employmentTemporary workCasual workWork at officeRemote workFlexible hours$147k - $237.5k
...Qualifications Must be a US Citizen. BS/MS in Computer Science/Engineering or equivalent training, education, and experience in... ...Deep knowledge of cloud security architecture, vulnerability management, and networking protocols (IP Networking, VPNs, DNS, Load Balancing...Full timeWork at officeVisa sponsorshipWork visa- ...role requires US Citizenship. Your Career As a Principal Site Reliability Engineer, you will serve as the technical authority for our cloud-... ...infrastructure‑as‑code model. Security Engineering : Implement and manage security scanning tools (Prisma Cloud, Snyk, or GKE native...Visa sponsorshipWork visaShift work
- ...Manager, Software Engineering-Kernels At d-Matrix, we are focused on unleashing the potential of generative AI to power the transformation of technology. We are at the forefront of software and hardware innovation, pushing the boundaries of what is possible. Our culture...Work experience placement3 days per week
$207k - $300k
Software Engineering Manager II, Site Reliability Engineering corporate_fare Google Sunnyvale, CA, USA Bachelor’s degree in Computer Science, a related field, or equivalent practical experience. 8 years of experience with software development in one or more programming...Full time$207k - $300k
Site Reliability Manager, Site Reliability Engineering Experience owning outcomes and decision making, solving ambiguous problems and influencing stakeholders; deep expertise in domain. Apply Qualifications Bachelor’s degree in Computer Science, a related field, or equivalent...Full time$143.2k - $186k
1600 NIO USA, Inc. is seeking a Senior OS / Kernel Engineer for the SkyOS team to design and develop full-domain vehicle operating systems. Candidates should have a strong background in operating system internals and proficiency in languages like C or Rust. The position...$110k - $150k
...recognize the importance of flexibility and trust our employees to manage their schedules responsibly. This may include occasional... ...accommodate family commitments. About the role As a Fleet Reliability Engineer at Applied Intuition, you will play a critical role in...Full timeFor contractorsFor subcontractorCasual workWork at officeRemote workDay shift- NVIDIA Gruppe is seeking an experienced professional to lead package-level reliability for semiconductor products in Santa Clara, California. The ideal candidate will possess a Master’s or PhD in a related field, along with 8+ years of hands-on experience in IC packaging...
$110k - $150k
Applied Intuition is looking for a Fleet Reliability Engineer in Sunnyvale, California. You will enhance vehicle fleet reliability by optimizing hardware and software systems. Ideal candidates have 5+ years of automotive experience and a relevant degree. Compensation includes...- Palo Alto Networks, Inc. is seeking a Principal Site Reliability Engineer in Santa Clara, CA. This role involves supporting a large infrastructure and ensuring applications are production-ready, scalable, and reliable. You'll work closely with developers and researchers...
$125.8k - $170.2k
A technology firm specializing in autonomous vehicle solutions is seeking a Senior Automotive Reliability & EMC Test Engineer to lead validation and compliance testing for LiDAR systems. The role involves defining testing strategies, leading environmental tests, and ensuring...$125.8k - $170.2k
...ground up at silicon photonics scale for mass‑market applications. Role Overview Aeva is seeking a Senior Automotive Reliability & EMC Test Engineer to own validation and compliance testing for automotive LiDAR systems through DV, PV, and production readiness. This role...Flexible hours$150k - $250k
...As our Founding Security Reliability Engineer at Charta Health, you'll pioneer the application of Site Reliability Engineering principles... ...their billing needs and highly sensitive data are expertly managed and continuously protected through robust security reliability...- A leading technology company is seeking a Linux Kernel Software Engineer to develop and optimize the Linux kernel for enterprise storage solutions. This role requires deep experience in kernel development and a strong foundation in computer systems. You will collaborate...
- NVIDIA Gruppe is seeking a Senior Software Engineer to work on system software for datacenter products in Santa Clara, California. This... ...will have over 10 years of experience, a strong grasp on Linux kernel internals, and expertise in data center architectures. Notably,...
$147.4k - $272.1k
...challenges? This position is offered in Apple’s Hardware Module Reliability Group. We guide development teams toward generating reliable... ...modes early in the design life cycle and by using our engineering knowledge to help find the best path towards mitigation. Here...Remote workRelocationOverseas$147.4k - $272.1k
Apple Inc. is looking for a Hardware Reliability Engineer for its Cupertino location. The role involves setting test specs, conducting reliability assessments, and analyzing data to guide the design process. The ideal candidate should have a bachelor’s degree in a relevant...$159k - $231k
...Apply Bachelor's degree in Electrical Engineering, Material Science, Mechanical Engineering... ...Circuit (ASIC) device and package reliability Joint Electron Device Engineering Council... ...You will work closely with the product management and design engineering teams to define...Full time- What You’ll Be Doing Own the package‑level reliability spec for assigned products Define qualification requirements and pass/fail criteria... ...requirement What We Need to See MS/PhD in Electrical Engineering, Materials Science, Mechanical Engineering, or related field,...
$168k - $264.5k
...human inventiveness and intelligence. Make the choice to join us today. We are seeking an outstanding candidate for Silicon Reliability Engineer who will serve as the foundry process reliability professional for NVIDIA in utilizing cutting edge technologies to deliver...$136k - $218.5k
...innovator in computer graphics, PC gaming, and accelerated computing, as we step into the next era shaped by AI. As a Senior Reliability Engineer, you'll work within a focused team, developing groundbreaking products that form the future. This unique position provides...$168k - $264.5k
NVIDIA Gruppe is seeking a Silicon Reliability Engineer to lead foundry process reliability for cutting-edge technologies. You will collaborate with teams to ensure high-performance products and develop reliability methodologies. The ideal candidate will have a PhD or...
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Kernel Reliability Engineering Manager. Be the first to apply!
- reliability engineer Sunnyvale, CA
- principal reliability engineer
- reliability engineering manager
- senior reliability engineer
- sr reliability engineer
- reliability maintenance engineering technician
- database reliability engineer
- maintenance & reliability engineer
- reliability engineer
- hardware reliability engineer

