Senior Cloud Platform Engineer

SambaNova Systems

Senior Cloud Platform Engineer

San Jose, California, United States

The era of pervasive AI has arrived. In this era, organizations will use generative AI to unlock hidden value in their data, accelerate processes, reduce costs, drive efficiency and innovation to fundamentally transform their businesses and operations at scale.

SambaNova Suite™ is the first full-stack, generative AI platform, from chip to model, optimized for enterprise and government organizations. Powered by the intelligent SN40L chip, the SambaNova Suite is a fully integrated platform, delivered on-premises or in the cloud, combined with state-of-the-art open-source models that can be easily and securely fine-tuned using customer data for greater accuracy. Once adapted with customer data, customers retain model ownership in perpetuity, so they can turn generative AI into one of their most valuable assets.

About SambaNova Systems

Join the company that's building the future of AI computing. At SambaNova, we are disrupting the AI and high-performance computing space with our integrated hardware and software platform. Our DataScale systems and SambaFlow software are pushing the boundaries of what's possible with generative AI and large language models. We are a team of passionate innovators tackling some of the world's most challenging computational problems.

The Role

As a Senior Cloud Site Reliability Engineer (SRE) specializing in our AI Inferencing Service, you will be the guardian of its reliability, performance, and scalability. You will bridge the gap between software development and operations, applying an engineering mindset to solve operational challenges. Your primary focus will be ensuring our inference endpoints have exceptional uptime, low-latency response times, and efficient resource utilization, directly impacting the experience of our customers and the success of our AI products. This role includes participating in a shared on-call rotation to maintain 24/7 service reliability.

What You'll Do

Service Ownership & On-Call: Take shared ownership of the production inferencing service, including its availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning across multiple regions. This includes implementing and supporting AI infrastructure in new regions, such as Asia, Europe, and Latin America, to support the growth of our business. Participate in a balanced on-call rotation to provide 24/7 support for the service.

On-Call & Work-Life Balance

We believe a sustainable on-call schedule is critical for long-term success and team health. Our on-call philosophy is built on the following principles:

Balanced Rotation: The on-call rotation is shared equally across the team, typically following a primary/secondary (follow-the-sun) model to ensure no single person bears a disproportionate burden.
Focus on Prevention: We invest heavily in automation, robust testing, and system design to prevent pages before they happen. The goal of on-call is not to heroically fight fires, but to manage rare, complex failures and use those learnings to make the system more resilient.
Actionable Alerts: We have a strict policy against alert fatigue. Alerts must be actionable and require immediate human intervention.
Incident Management: Lead the response to incidents affecting the inferencing service, driving blameless post-mortems and implementing corrective actions to prevent recurrence.
Monitoring & Alerting: Develop and maintain advanced monitoring, alerting, and dashboarding (using tools like Prometheus, Grafana, Datadog) to gain deep insights into service health, model performance (e.g., latency, throughput, error rates), and accelerator utilization. A key responsibility is ensuring alerts are actionable and have a low false-positive rate, minimizing on-call fatigue.
Performance & Scalability: Proactively identify and eliminate performance bottlenecks. Design and implement auto-scaling policies to handle variable inference loads cost-effectively. Use insights from on-call incidents to drive improvements that enhance system stability and scalability.
Infrastructure as Code (IaC): Manage and evolve our cloud infrastructure (on AWS, GCP, and/or Azure along with on-prem) using tools like Terraform and Ansible, ensuring it is secure, repeatable, and scalable.
CI/CD & Automation: Champion automation by building and improving CI/CD pipelines for the seamless and safe deployment of new model versions and service updates. A core goal is to automate manual toil identified during on-call shifts, reducing future operational overhead.
Capacity Planning: Forecast infrastructure needs based on product roadmaps and usage trends. Work with finance and engineering teams to manage cloud costs and optimize spending.
SLOs & SLIs: Define, measure, and report on Service Level Objectives (SLOs) and Indicators (SLIs) for the inferencing platform, using data to drive prioritization and reliability investments.

What We're Looking For (Must-Haves)

Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
5-8+ years of experience in a Site Reliability Engineer, DevOps, or related role supporting a large-scale, customer-facing service in a public cloud environment (AWS, GCP, Azure).
Strong programming/scripting skills in languages like Python, Go, or Java.
Proven experience with containerization and orchestration technologies (Docker, Kubernetes).
Deep understanding of monitoring and observability principles and tools (e.g., Prometheus, Grafana, ELK Stack, Datadog).
Solid experience with Infrastructure as Code (e.g., Terraform, CloudFormation).
Familiarity with CI/CD principles and tools (e.g., Jenkins, GitHub Actions, ArgoCD).
Excellent problem-solving skills and a systematic approach to troubleshooting complex distributed systems.

What Will Make You Stand Out (Nice-to-Haves)

Experience in a hybrid environment bridging cloud and on-premise/data center infrastructure.
Direct experience supporting ML/AI inferencing services in production.
Familiarity with GPU-accelerated computing and optimizing workloads for NVIDIA GPUs for purposes of mapping to RDUs.
Knowledge of model serving frameworks like vLLM, SGLang or Ray.
Understanding of MLOps principles and practices.
Experience with managing and tuning databases (SQL or NoSQL) and caching systems (Redis, Memcached).
Strong Linux/Unix system administration fundamentals.

Why SambaNova?

Massive Impact: You will be a key part of a critical platform with high visibility and direct impact on our product and engineers.
Cutting-Edge Technology: Work with a world-class team on one of the most advanced AI stacks in the industry.
Autonomy and Growth: We trust you to make technical decisions. This is a greenfield opportunity to build something remarkable from the ground up.
Competitive Compensation: Including equity, excellent benefits, and a flexible work environment.

Submission Guidelines Please note that in order to be considered an applicant for any position at SambaNova Systems, you must submit an application form for each position for which you believe you are qualified.

EEO Policy SambaNova Systems is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard basis of age (40 and over), color, disability, gender identity, genetic information, marital status, military or veteran status, national origin/ancestry, race, religion, creed, sex (including pregnancy, childbirth, breastfeeding), sexual orientation, and any other applicable status protected by federal, state, or local laws.

Benefits Summary for US-Based, Full-Time Employment Positions SambaNova offers a competitive total rewards package, including the base salary, plus equity and benefits. We cover 95% premium coverage for employee medical insurance, and 77% premium coverage for dependents and offer a Health Savings Account (HSA) with employer contribution. We also offer Dental, Vision, Short/Long term Disability, Basic Life, Voluntary Life, and AD&D insurance plans in addition to Flexible Spending Account (FSA) options like Health Care, Limited Purpose, and Dependent Care. Our library of well-being benefits available to you and your dependents includes a full subscription to Headspace, Gympass+ membership with access to physical gyms, One Medical membership, counseling services with an Employee Assistance Program, and much more.

Apply

Vacancy posted 13 hours ago

Similar jobs that could be interesting for youBased on the Senior Cloud Platform Engineer in San Jose, CA vacancy

Senior Attestation Platform Engineer
...technology company in Santa Clara is seeking a seasoned professional to develop their attestation platform. The role involves leading the design and construction of highly available cloud services, ensuring integrity across NVIDIA's systems. Ideal candidates will have extensive...
Cloud
Senior
NVIDIA Corporation
Santa Clara, CA
10 hours ago
Senior AI Platform Engineer - Scale LLM Infra
$168k - $322k
NVIDIA Gruppe is seeking a Senior AI Platform Engineer to improve engineering efficiency and data security through AI-powered products. The role involves working with Cloud and AI/ML teams to build and scale infrastructure and shape the technological future of the organization...
Cloud
Senior
NVIDIA Gruppe
Santa Clara, CA
20 hours ago
Senior Platform Engineer, NIM SDK & Framework AI Inference
$184k - $356.5k
NVIDIA Gruppe is seeking a Senior Engineer to lead the evolution of the core NIM Platform SDK and microservice framework in Santa Clara, California. This hands-on... ...systems programming and significant experience with cloud-native architectures, contributing to production-...
Cloud
Senior
NVIDIA Gruppe
Santa Clara, CA
20 hours ago
Senior Platform Engineer: Kubernetes, CI/CD & Cloud Ops
A tech company is seeking a Senior DevOps Engineer to enhance and automate its infrastructure for a site-builder platform. This position focuses on creating robust CI/CD pipelines... ...with Kubernetes, CI/CD practices, and cloud services in AWS. Responsibilities include...
Cloud
Senior
TechDigital Group
Santa Clara, CA
3 days ago
Senior Java SRE & Platform Engineer - AWS/Kubernetes
A leading technology company is looking for a Java SRE Engineer to support large-scale cloud migrations and production systems on AWS and Kubernetes. You will lead migrations, design robust AWS EKS platforms, and implement deployment strategies. The ideal candidate has...
Cloud
Senior
EITACIES Inc.
Santa Clara, CA
2 days ago
Senior AV Cloud Platform Engineer
$170k - $230k
General Motors is hiring a Senior Platform Engineer to enhance the Autonomous Vehicle (AV) Cloud Engineering team. The role involves building and evolving platform capabilities that facilitate faster AV development. Ideal candidates will have a strong background in Kubernetes...
Cloud
Senior
General Motors
Sunnyvale, CA
20 hours ago
Senior Cloud Platform Engineer- Presto SaaS (BYOC, GPU Platforms)
...Introduction We are hiring a senior engineer to design and deliver a BYOC (Bring Your Own Cloud) platform for Presto SaaS across Azure and AWS (IBM Cloud is a strong plus), with a focus on GPU-enabled infrastructure. This role will lead architecture and implementation...
Cloud
Senior
IBM
San Jose, CA
2 days ago
Senior Platform Cloud Engineer & Solutions Architect
$153k - $222k
Google is seeking a Platform Customer Engineer in Sunnyvale, CA to partner with technical sales teams and serve as trusted advisors for cloud solutions. Responsibilities include developing technical account plans, delivering effective demos, and troubleshooting technical...
Cloud
Senior
Google
Sunnyvale, CA
2 days ago
Senior AI Platform Engineer, E-commerce — Equity
NVIDIA Gruppe in Santa Clara is looking for a seasoned engineer to work on their GeForce NOW Cloud Gaming service. Your main tasks will include architecting production-ready AI agents and enhancing customer experience through intelligent systems. The ideal candidate will...
Cloud
Senior
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior Platform Cloud Solutions Engineer
$153k - $222k
Google is seeking a Platform Customer Engineer in Sunnyvale, CA to bolster customer relationships through cloud solutions. The role involves engaging directly with clients to resolve technical challenges in infrastructure modernization and application development. Candidates...
Cloud
Senior
Google
Sunnyvale, CA
20 hours ago
Senior Agentic Platform Engineer
...of years' experience 10 years Job Title* and Job band Senior Agentic Platform Engineer Detailed job description - Skill Set*: Senior Agentic... ...Analytics Engineering Mastery (5 years): Expert-level dbt (Core/Cloud) and Snowflake architecture, with specific experience...
Cloud
Senior
Contract work
Local area
eTeam
Sunnyvale, CA
4 days ago
Senior Manager API Platform Engineer
...Sr. Manager API Platform Make Next Happen Now. For more than 30 years, the Bank has helped... ...cross-functionally with Architects, Engineers, Business Analysts, and Service Managers... ...deliver quality solutions on time ~ Knowledge of public cloud (AWS, GCP or Azure)...
Cloud
Senior
Professional Recruiters
Santa Clara, CA
2 days ago
Senior Platform Devops Engineer- Architecture
$85 - $90 per hour
...Senior Platform Devops Engineer- Architecture Immediate need for a talented Senior Platform Devops Engineer- Architecture. This is a 02+months contract... ..., Docker, Architecture in production system, Azure cloud ~5+ yrs in Kubernetes, Terraform, Production level bugs,...
Cloud
Senior
Contract work
Local area
Immediate start
Pyramid Consulting
Milpitas, CA
4 days ago
Senior Data Platform Engineer - Scalable Cloud (Hybrid)
Nutanix is seeking a talented backend developer to join its core Data Platform team in Santa Clara, CA. In this role, you will design and maintain high-performance software platforms that serve millions. Your responsibilities will include optimizing system performance,...
Cloud
Senior
Nutanix
Santa Clara, CA
20 hours ago
Senior Embedded Systems & GPU Platform Engineer
...beyond. Together, we advance your career. SENIOR GPU FIRMWARE ENGINEER Firmware Application Engineer - Datacenter GPU Platforms THE ROLE: Join AMD's... ...Application team to support GPU deployments across Cloud, HPC, and OEM segments. You'll work...
Cloud
Senior
Advanced Micro Devices , Inc.
Santa Clara, CA
7 days ago
Senior Platform Engineer
...Senior Platform Engineer Lambda, the superintelligence cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as...
Cloud
Senior
Work at office
Local area
Work from home
Flexible hours
Lambda Corporation
San Jose, CA
14 hours ago
Senior Platform Engineer - AI
$156.8k - $229.7k
...investments in AI-driven product features, the Platform Engineering team is the engine room for this... ...workflows. We are seeking a Senior AI Platform Engineer to lead the charge... ...cutting-edge AI Architecture and scalable Cloud Infrastructure. You'll be architecting...
Cloud
Senior
Full time
GFiber
Sunnyvale, CA
4 days ago
Senior Platform Engineer II, Compute Services
$165k - $242k
...Senior Platform Engineer II, Compute Services Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators...
Cloud
Senior
Permanent employment
Temporary work
Casual work
Work at office
Remote work
Flexible hours
CoreWeave
Sunnyvale, CA
4 days ago
Senior Platform Software Engineer
$136.5k - $276.5k
...Senior Platform Software Engineer This role has been designed as 'Hybrid' with an expectation that you will work on average 2 days per week from... ...office. Hewlett Packard Enterprise is the global edge-to-cloud company advancing the way people live and work. We help...
Cloud
Senior
Work experience placement
Work at office
Local area
Immediate start
2 days per week
Hewlett Packard Enterprise
Sunnyvale, CA
20 hours ago
Senior Machine Learning Platform Engineer - AI, Search & Knowledge
$212k - $318.4k
...Senior Machine Learning Platform Engineer - AI, Search & Knowledge Work Locations (2) Submit Resume Join us in building the AI, Search & Knowledge... ...infrastructure, and product teams ~ Experience with cloud platforms (AWS, GCP, Azure) and container orchestration...
Cloud
Senior
Relocation
Apple
Cupertino, CA
1 day ago
Senior Platform Engineer - AV Cloud Engineering
Job Description We are hiring a Senior Platform Engineer to join the Autonomous Vehicle (AV) Cloud Engineering team within AV Core Infrastructure. This role focuses on treating cloud infrastructure as a product and is motivated by building a reliable, easy‑to‑consume platform...
Cloud
Senior
Work experience placement
Local area
General Motors
Sunnyvale, CA
20 hours ago
Senior Platform Software Engineer
$120.5k - $243k
...Senior Platform Software Engineer This role has been designed as ‘Hybrid’ with an expectation that you will work on average 2 days per week from... ...Are: Hewlett Packard Enterprise is the global edge-to-cloud company advancing the way people live and work. We help...
Cloud
Senior
Work experience placement
Work at office
Local area
Immediate start
2 days per week
HPE
Sunnyvale, CA
6 days ago
Senior Platform Engineer (Cloud Workloads)
$158.4k - $294.1k
...biggest brands. About the Role We are looking for a Senior Platform Engineer to join the Workload team within the Veeam R&D Department.... ...and lead incident response for distributed, multi-tenant cloud workloads; own runbook creation, maintenance, and continuous...
Cloud
Senior
Base plus commission
Local area
Worldwide
Veeam Software
San Jose, CA
3 days ago
Senior Performance Engineer, Data Protection & Security Platform Engineering
$179.06k - $198.95k
We are looking for a highly motivated Senior Performance Engineer to join our Data Protection and... ...initiatives across Cohesity’s distributed platform and services. You will work closely with... ...solutions across on‑prem and cloud environments. The ideal candidate is...
Cloud
Senior
Cerebras
Santa Clara, CA
1 day ago
Senior Platform & AI Engineer
$139k - $257.55k
...Solutions is looking for a full time Data & AI Engineer with experience in building data... ...Data Engineer - Associate or AWS Certified Cloud Practitioner • 10+ years experience in... ...Experience & knowledge with Customer Data Platform (CDP) or Data Management Platform (DMP)...
Cloud
Senior
Full time
Temporary work
Local area
Worldwide
Adobe
San Jose, CA
1 day ago
Senior Staff AI Platform Engineer
NVIDIA is looking to hire a deeply technical, creative, and Senior AI Platform Engineer to build, support, and maintain the next generation of AI-... ...This role will give you the opportunity to collaborate with Cloud and AI/ML teams in a multifaceted and agile environment and...
Cloud
Senior
NVIDIA Gruppe
Santa Clara, CA
20 hours ago
Senior Software Engineer, Infrastructure, Platforms Infrastructure Engineering
$166k - $244k
Senior Software Engineer, Infrastructure, Platforms Infrastructure Engineering Mid Experience driving progress, solving problems, and mentoring more junior team... ...velocity. Our customers include Googlers, Googler Cloud customers, and billions of Google users worldwide. We...
Cloud
Senior
Full time
Worldwide
Google Inc.
Sunnyvale, CA
4 days ago
Senior Backend Platform Software Engineer - Special Projects
$181.1k - $318.4k
...Senior Backend Platform Software Engineer - Special Projects Apple is where individual imaginations gather together, committing to the values that lead... ...systems using your deep expertise in Kubernetes and cloud-native technologies. You will help design, scale, and...
Cloud
Senior
Relocation
Apple
Cupertino, CA
4 days ago
Senior Software Development Engineer, SDK & Cloud Data Platform Engineering
$173.5k - $331.05k
...of creativity by building SDKs and platform libraries that power data-driven insights... ...experiences across Creative Cloud. We are seeking a software engineer with strong development and computer... ...fully defined, with guidance from senior engineers and architects. ~- Strong...
Cloud
Senior
Temporary work
Local area
Worldwide
Adobe
San Jose, CA
2 days ago
Senior Software Engineer, Performance, Platforms Infrastructure Engineering
$174k - $252k
Senior Software Engineer, Performance, Platforms Infrastructure Engineering Google, Sunnyvale, CA, USA Bachelor’s degree or equivalent practical experience... ...and velocity. Our customers include Googlers, Google Cloud customers, and billions of Google users worldwide....
Cloud
Senior
Full time
Worldwide
Google Inc.
Sunnyvale, CA
4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Cloud Platform Engineer. Be the first to apply!