Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Kernel Reliability Engineering Manager

Dormont Manufacturing Co

Cerebras Systems builds the world’s largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. Cerebras’ current customers include top model labs, global enterprises, and cutting-edge AI-native startups. OpenAI recently announced a multi-year partnership with Cerebras , to deploy 750 megawatts of scale, transforming key workloads with ultra high-speed inference. Thanks to the groundbreaking wafer-scale architecture, Cerebras Inference offers the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services. This order of magnitude increase in speed is transforming the user experience of AI applications, unlocking real-time iteration and increasing intelligence via additional agentic computation. The Role We’re looking for a deeply technical, hands-on engineering leader for our on-field Kernel Reliability team. You will lead a high performing team to tackle a critical challenge: improving the reliability of our advanced compute clusters and the underlying inference, training, and internal production services. In this role, you’ll set the technical vision while staying close to the code and designing solutions that will scale to our exponentially growing system production and software service offerings. If you have proven expertise in software or hardware reliability, diagnostic tool building, or failure analysis and debugging, we want to hear from you. Responsibilities Provide hands-on technical leadership, owning the technical vision and roadmap for the kernel-centric reliability of our internal and customer-facing systems Assist System and Cluster Operations teams on reducing system and service downtime after failure by providing tooling and manual intervention for failure analysis and diagnostic Work with the Debug Team to enhance debug tools with the goal of speeding up failure analysis Collaborate with SW teams to improve the software stack, including Kernels, to improve on-field debugging and failure analysis Work with the ASIC an HW architecture teams to codesign the next generation architectures with reliability and ease of debug in mind Lead, mentor, and grow a high-caliber team of engineers, fostering a culture of technical excellence and rapid execution. Skills & Qualifications 6+ years in software engineering, with 3+ years leading teams in SW/HW reliability, debug, diagnostic, failure analysis or related fields Expertise in parallel and distributed programming (message passing, multicore, GPU, embeded, etc.), debug and diagnostic tool development or expert usage (debuggers, core dump handling, code sanitizers, etc.), experience debugging distributed and parallel applications (deadlocks, livelocks, race conditions, etc.), deep understanding of computer architectures (instruction pipelining, multithreading, networking, etc.) Operations & Monitoring: Strong background in monitoring and reliability engineering (incident response, post-mortem analysis, etc.) Leadership & Collaboration: Demonstrated ability to recruit and retain high-performing teams, mentor engineers, and partner cross-functionally to deliver customer-facing products. Why Join Cerebras People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we’ve reached an inflection point in our business. Members of our team tell us there are five main reasons they joined Cerebras: Build a breakthrough AI platform beyond the constraints of the GPU. Publish and open source their cutting-edge AI research. Work on one of the fastest AI supercomputers in the world. Enjoy job stability with startup vitality. Our simple, non-corporate work culture that respects individual beliefs. Read our blog: Five Reasons to Join Cerebras in 2026. Apply today and become part of the forefront of groundbreaking advancements in AI! _Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer._ We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them. This website or its third-party tools process personal data. For more details, click here to review our CCPA disclosure notice. #J-18808-Ljbffr Dormont Manufacturing Co

Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the Kernel Reliability Engineering Manager in Sunnyvale, CA vacancy
  • Cerebras Systems is seeking a deeply technical software engineer for its Kernel Reliability team in Sunnyvale, California. This role involves enhancing the reliability of advanced compute clusters. The ideal candidate will have strong programming skills in C/C++ and Python... 
    Suggested

    Dormont Manufacturing Co

    Sunnyvale, CA
    4 days ago
  • $207k - $300k

    A leading technology company in Sunnyvale, CA is seeking a Software Engineering Manager II for Site Reliability Engineering. You'll lead a team to ensure uptime and optimize the availability, scalability, and performance of key services. With a focus on automation and system... 
    Suggested

    Google Inc.

    Sunnyvale, CA
    22 hours ago
  • $272k - $431.25k

     ...we need to see BS or MS degree in Computer Science, Electrical Engineering or related field (or equivalent experience) Strong C and C++...  ...Knowledge of memory coherence and consistency models Background with kernel mode development Experience with Linux Systems Software... 
    Suggested

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $200k - $322k

     ...environment, where NVIDIANs are inspired to excel and make a profound global impact. NVIDIA is seeking a Senior Manager of Site Reliability Engineering to lead and reshape how IT operations function at scale. This role goes beyond traditional service management to build... 
    Suggested

    NVIDIA

    Santa Clara, CA
    22 hours ago
  • $207k - $300k

    Site Reliability Engineering Manager, Google Distributed Cloud Google Sunnyvale, CA, USA Bachelor’s degree in Computer Science, a related field, or equivalent practical experience. 8 years of experience building or managing distributed systems or cloud infrastructure,... 
    Suggested
    Full time

    Google Inc.

    Sunnyvale, CA
    1 day ago
  • $202k - $247k

    Job Category Site Reliability Engineering Posting Date 11/18/2025, 12:24 AM Locations Santa Clara, CA, United States Job Schedule Full time...  ...spanning our cloud accounts, network/connectivity, workload management, observability, and storage services. We build tooling to... 
    Full time
    Worldwide

    Fortinet, Inc.

    Santa Clara, CA
    3 days ago
  • $151.6k - $245.3k

     ...Networks runs a large infrastructure and is one of the largest GCP customers. As a Principal Site Reliability Engineer for the ADEM (Autonomous Digital Experience Management) team, you will be part of a team supporting the services that provide end-to-end visibility and... 
    Full time
    Work at office
    Visa sponsorship
    Work visa

    Palo Alto Networks, Inc.

    Santa Clara, CA
    4 days ago
  • $165k - $242k

     ...Systems Engineer, Kernel Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA CoreWeave...  ...our HAVOCK Team, reporting into the Manager of Systems Engineering. In this role,...  ...that improves the performance and reliability of our stack. This position is ideal... 
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Remote work
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    3 days ago
  • $147k - $237.5k

     ...Qualifications Must be a US Citizen. BS/MS in Computer Science/Engineering or equivalent training, education, and experience in...  ...Deep knowledge of cloud security architecture, vulnerability management, and networking protocols (IP Networking, VPNs, DNS, Load Balancing... 
    Full time
    Work at office
    Visa sponsorship
    Work visa

    Palo Alto Networks, Inc.

    Santa Clara, CA
    2 days ago
  •  ...role requires US Citizenship. Your Career As a Principal Site Reliability Engineer, you will serve as the technical authority for our cloud-...  ...infrastructure‑as‑code model. Security Engineering : Implement and manage security scanning tools (Prisma Cloud, Snyk, or GKE native... 
    Visa sponsorship
    Work visa
    Shift work

    Palo Alto Networks, Inc.

    Santa Clara, CA
    3 days ago
  •  ...Manager, Software Engineering-Kernels At d-Matrix, we are focused on unleashing the potential of generative AI to power the transformation of technology. We are at the forefront of software and hardware innovation, pushing the boundaries of what is possible. Our culture... 
    Work experience placement
    3 days per week

    D-Matrix

    Santa Clara, CA
    3 days ago
  • $207k - $300k

    Software Engineering Manager II, Site Reliability Engineering corporate_fare Google Sunnyvale, CA, USA Bachelor’s degree in Computer Science, a related field, or equivalent practical experience. 8 years of experience with software development in one or more programming... 
    Full time

    Google Inc.

    Sunnyvale, CA
    22 hours ago
  • $207k - $300k

    Site Reliability Manager, Site Reliability Engineering Experience owning outcomes and decision making, solving ambiguous problems and influencing stakeholders; deep expertise in domain. Apply Qualifications Bachelor’s degree in Computer Science, a related field, or equivalent... 
    Full time

    Google Inc.

    Mountain View, CA
    1 day ago
  • $143.2k - $186k

    1600 NIO USA, Inc. is seeking a Senior OS / Kernel Engineer for the SkyOS team to design and develop full-domain vehicle operating systems. Candidates should have a strong background in operating system internals and proficiency in languages like C or Rust. The position... 

    1600 NIO USA, Inc.

    San Jose, CA
    4 days ago
  • $110k - $150k

     ...recognize the importance of flexibility and trust our employees to manage their schedules responsibly. This may include occasional...  ...accommodate family commitments. About the role As a Fleet Reliability Engineer at Applied Intuition, you will play a critical role in... 
    Full time
    For contractors
    For subcontractor
    Casual work
    Work at office
    Remote work
    Day shift

    Applied Intuition

    Sunnyvale, CA
    1 day ago
  • NVIDIA Gruppe is seeking an experienced professional to lead package-level reliability for semiconductor products in Santa Clara, California. The ideal candidate will possess a Master’s or PhD in a related field, along with 8+ years of hands-on experience in IC packaging... 

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $110k - $150k

    Applied Intuition is looking for a Fleet Reliability Engineer in Sunnyvale, California. You will enhance vehicle fleet reliability by optimizing hardware and software systems. Ideal candidates have 5+ years of automotive experience and a relevant degree. Compensation includes... 

    Decisive Point

    Sunnyvale, CA
    2 days ago
  • Palo Alto Networks, Inc. is seeking a Principal Site Reliability Engineer in Santa Clara, CA. This role involves supporting a large infrastructure and ensuring applications are production-ready, scalable, and reliable. You'll work closely with developers and researchers... 

    Palo Alto Networks, Inc.

    Santa Clara, CA
    3 days ago
  • $125.8k - $170.2k

    A technology firm specializing in autonomous vehicle solutions is seeking a Senior Automotive Reliability & EMC Test Engineer to lead validation and compliance testing for LiDAR systems. The role involves defining testing strategies, leading environmental tests, and ensuring... 

    Aeva, Inc.

    Mountain View, CA
    2 days ago
  • $125.8k - $170.2k

     ...ground up at silicon photonics scale for mass‑market applications. Role Overview Aeva is seeking a Senior Automotive Reliability & EMC Test Engineer to own validation and compliance testing for automotive LiDAR systems through DV, PV, and production readiness. This role... 
    Flexible hours

    Clutch Canada

    Mountain View, CA
    1 day ago
  • $150k - $250k

     ...As our Founding Security Reliability Engineer at Charta Health, you'll pioneer the application of Site Reliability Engineering principles...  ...their billing needs and highly sensitive data are expertly managed and continuously protected through robust security reliability... 

    Charta Health

    Sunnyvale, CA
    4 days ago
  • A leading technology company is seeking a Linux Kernel Software Engineer to develop and optimize the Linux kernel for enterprise storage solutions. This role requires deep experience in kernel development and a strong foundation in computer systems. You will collaborate... 

    Pure Storage, Inc.

    Santa Clara, CA
    1 day ago
  • NVIDIA Gruppe is seeking a Senior Software Engineer to work on system software for datacenter products in Santa Clara, California. This...  ...will have over 10 years of experience, a strong grasp on Linux kernel internals, and expertise in data center architectures. Notably,... 

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $147.4k - $272.1k

     ...challenges? This position is offered in Apple’s Hardware Module Reliability Group. We guide development teams toward generating reliable...  ...modes early in the design life cycle and by using our engineering knowledge to help find the best path towards mitigation. Here... 
    Remote work
    Relocation
    Overseas

    Apple Inc.

    Cupertino, CA
    3 days ago
  • $147.4k - $272.1k

    Apple Inc. is looking for a Hardware Reliability Engineer for its Cupertino location. The role involves setting test specs, conducting reliability assessments, and analyzing data to guide the design process. The ideal candidate should have a bachelor’s degree in a relevant... 

    Apple Inc.

    Cupertino, CA
    3 days ago
  • $159k - $231k

     ...Apply Bachelor's degree in Electrical Engineering, Material Science, Mechanical Engineering...  ...Circuit (ASIC) device and package reliability Joint Electron Device Engineering Council...  ...You will work closely with the product management and design engineering teams to define... 
    Full time

    Google Inc.

    Mountain View, CA
    4 days ago
  • What You’ll Be Doing Own the package‑level reliability spec for assigned products Define qualification requirements and pass/fail criteria...  ...requirement What We Need to See MS/PhD in Electrical Engineering, Materials Science, Mechanical Engineering, or related field,... 

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $168k - $264.5k

     ...human inventiveness and intelligence. Make the choice to join us today. We are seeking an outstanding candidate for Silicon Reliability Engineer who will serve as the foundry process reliability professional for NVIDIA in utilizing cutting edge technologies to deliver... 

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $136k - $218.5k

     ...innovator in computer graphics, PC gaming, and accelerated computing, as we step into the next era shaped by AI. As a Senior Reliability Engineer, you'll work within a focused team, developing groundbreaking products that form the future. This unique position provides... 

    NVIDIA Corporation

    Santa Clara, CA
    4 days ago
  • $168k - $264.5k

    NVIDIA Gruppe is seeking a Silicon Reliability Engineer to lead foundry process reliability for cutting-edge technologies. You will collaborate with teams to ensure high-performance products and develop reliability methodologies. The ideal candidate will have a PhD or... 

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Kernel Reliability Engineering Manager. Be the first to apply!