Senior Cloud Platform Engineer

Softbank Investment Advisers

Senior Cloud Platform Engineer

Join SambaNova Systems, a leader in frontier tech, and help shape the future of AI computing. We are disrupting the AI and high-performance computing space with our integrated hardware and software platform. Our DataScale systems and SambaFlow software are pushing the boundaries of what's possible with generative AI and large language models. We are a team of passionate innovators tackling some of the world's most challenging computational problems.

As a Senior Cloud Site Reliability Engineer (SRE) specializing in our AI Inferencing Service, you will be the guardian of its reliability, performance, and scalability. You will bridge the gap between software development and operations, applying an engineering mindset to solve operational challenges. Your primary focus will be ensuring our inference endpoints have exceptional uptime, low-latency response times, and efficient resource utilization, directly impacting the experience of our customers and the success of our AI products. This role includes participating in a shared on-call rotation to maintain 24/7 service reliability.

Service Ownership & On-Call: Take shared ownership of the production inferencing service, including its availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning across multiple regions. This includes implementing and supporting AI infrastructure in new regions, such as Asia, Europe, and Latin America, to support the growth of our business. Participate in a balanced on-call rotation to provide 24/7 support for the service.

On-Call & Work-Life Balance: We believe a sustainable on-call schedule is critical for long-term success and team health. Our on-call philosophy is built on the following principles:

Balanced Rotation: The on-call rotation is shared equally across the team, typically following a primary/secondary (follow-the-sun) model to ensure no single person bears a disproportionate burden.
Focus on Prevention: We invest heavily in automation, robust testing, and system design to prevent pages before they happen. The goal of on-call is not to heroically fight fires, but to manage rare, complex failures and use those learnings to make the system more resilient.
Actionable Alerts: We have a strict policy against alert fatigue. Alerts must be actionable and require immediate human intervention.
Incident Management: Lead the response to incidents affecting the inferencing service, driving blameless post-mortems and implementing corrective actions to prevent recurrence.
Monitoring & Alerting: Develop and maintain advanced monitoring, alerting, and dashboarding (using tools like Prometheus, Grafana, Datadog) to gain deep insights into service health, model performance (e.g., latency, throughput, error rates), and accelerator utilization. A key responsibility is ensuring alerts are actionable and have a low false-positive rate, minimizing on-call fatigue.
Performance & Scalability: Proactively identify and eliminate performance bottlenecks. Design and implement auto-scaling policies to handle variable inference loads cost-effectively. Use insights from on-call incidents to drive improvements that enhance system stability and scalability.
Infrastructure as Code (IaC): Manage and evolve our cloud infrastructure (on AWS, GCP, and/or Azure along with on-prem) using tools like Terraform and Ansible, ensuring it is secure, repeatable, and scalable.
CI/CD & Automation: Champion automation by building and improving CI/CD pipelines for the seamless and safe deployment of new model versions and service updates. A core goal is to automate manual toil identified during on-call shifts, reducing future operational overhead.
Capacity Planning: Forecast infrastructure needs based on product roadmaps and usage trends. Work with finance and engineering teams to manage cloud costs and optimize spending.
SLOs & SLIs: Define, measure, and report on Service Level Objectives (SLOs) and Indicators (SLIs) for the inferencing platform, using data to drive prioritization and reliability investments.

What We're Looking For (Must-Haves):

Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
5-8+ years of experience in a Site Reliability Engineer, DevOps, or related role supporting a large-scale, customer-facing service in a public cloud environment (AWS, GCP, Azure).
Strong programming/scripting skills in languages like Python, Go, or Java.
Proven experience with containerization and orchestration technologies (Docker, Kubernetes).
Deep understanding of monitoring and observability principles and tools (e.g., Prometheus, Grafana, ELK Stack, Datadog).
Solid experience with Infrastructure as Code (e.g., Terraform, CloudFormation).
Familiarity with CI/CD principles and tools (e.g., Jenkins, GitHub Actions, ArgoCD).
Excellent problem-solving skills and a systematic approach to troubleshooting complex distributed systems.

What Will Make You Stand Out (Nice-to-Haves):

Experience in a hybrid environment bridging cloud and on-premise/data center infrastructure.
Direct experience supporting ML/AI inferencing services in production.
Familiarity with GPU-accelerated computing and optimizing workloads for NVIDIA GPUs for purposes of mapping to RDUs.
Knowledge of model serving frameworks like vLLM, SGLang or Ray.
Understanding of MLOps principles and practices.
Experience with managing and tuning databases (SQL or NoSQL) and caching systems (Redis, Memcached).
Strong Linux/Unix system administration fundamentals.

Why SambaNova?

Massive Impact: You will be a key part of a critical platform with high visibility and direct impact on our product and engineers.
Cutting-Edge Technology: Work with a world-class team on one of the most advanced AI stacks in the industry.
Autonomy and Growth: We trust you to make technical decisions. This is a greenfield opportunity to build something remarkable from the ground up.
Competitive Compensation: Including equity, excellent benefits, and a flexible work environment.

Apply

Vacancy posted 9 hours ago

Similar jobs that could be interesting for youBased on the Senior Cloud Platform Engineer in Palo Alto, CA vacancy

Senior Platform Engineer
...contribute to the decentralization of the future. The Role As a Senior Platform Engineer, you'll own the backend systems and infrastructure that... ...‑as‑code Build and operate scalable, reliable systems on cloud infrastructure (AWS, GCP, or similar) Develop high‑quality,...
Cloud
Senior
Remote work
Flexible hours
SAGA
Los Altos, CA
3 days ago
Senior Platform Engineer
$225k - $300k
...DataHub is an AI & Data Context Platform adopted by over 3,000 enterprises, including Apple... ...'s enterprise SaaS offering, DataHub Cloud, delivers a fully managed solution with... ...data supply chain feeding AI systems Engineering teams struggling with data discovery, lineage...
Cloud
Senior
Work at office
Local area
Remote work
Worldwide
Home office
Flexible hours
DataHub Inc
Palo Alto, CA
5 days ago
Senior Software Engineer - AI Platform
$144k - $236k
...s AI model training, feature engineering and serving with hundreds of... ...the state-of-the-art Feature Platform, which empowers AI Users to effortlessly... ...infra on top of native cloud, enable GPU based inference... ...performance possible. As a Senior Software Engineer, you will...
Cloud
Senior
Full time
For contractors
Work experience placement
Work at office
Flexible hours
LinkedIn
Mountain View, CA
2 days ago
Senior Platform Engineer
...Job Title 7+ years building production cloud services at scale. Strong proficiency in Python and modern API design. Hands-on experience shipping agentic AI or tool-using agent systems to production. Proven track record of building reliable, observable systems...
Cloud
Senior
Saxon Global
Redwood City, CA
5 days ago
Senior Platform Software Engineer
...Role Overview: This role is for a Senior Platform Software Engineer with a focus on Java, responsible for developing and maintaining distributed systems and cloud-native applications. The position requires significant experience with Apple products and a strong background...
Cloud
Senior
Prophecy Technologies
Sunnyvale, CA
2 days ago
Senior Software Engineer, Platform
$145k - $182k
..., deploy and manage reliability, feature flags and cloud costs. The Harness Software Delivery Platform includes modules for CI, CD, Cloud Cost Management,... ...Management, Security Testing Orchestration, Chaos Engineering, Software Engineering Insights and continues to expand...
Cloud
Senior
Full time
Local area
Immediate start
Flexible hours
Harness
Mountain View, CA
1 day ago
Senior Platform Software Engineer
$120.5k - $243k
...Senior Platform Software Engineer This role has been designed as ‘Hybrid’ with an expectation that you will work on average 2 days per week from an... ...We Are Hewlett Packard Enterprise is the global edge-to-cloud company advancing the way people live and work. We help companies...
Cloud
Senior
Work experience placement
Work at office
2 days per week
Hobbsnews
Sunnyvale, CA
2 days ago
Senior Technical Program Manager
About The Role The Senior Technical Program Manager owns end-to-... ...functional initiatives across Volta's platform and delivery organisations.... ...dependencies across engineering, delivery, and DSX teams, translating... ...within infrastructure, cloud, or hardware-adjacent environments...
Cloud
Senior
Full time
Volta
Palo Alto, CA
1 day ago
Senior Director of Product Management
About The Role The Senior Director of Product Management owns product... ...leadership across Volta's platform product organisation —... ...across the team Partner with engineering leadership to align product and... ...experience in infrastructure, cloud, or developer-facing products...
Cloud
Senior
Full time
Volta
Palo Alto, CA
1 day ago
Senior Engineering Manager for Workspace Platform
$217k - $312.2k
Senior Engineering Manager for Workspace Platform Join to apply for the Senior Engineering Manager for Workspace Platform role at Databricks . RDQ225R488... ...technologies. Expertise in distributed systems, major cloud platforms (AWS, Azure, GCP), and modern web application...
Cloud
Senior
Local area
Worldwide
Databricks
Mountain View, CA
4 days ago
Senior Product Marketing Manager
About The Role The Senior Product Marketing Manager owns positioning... ...-market strategy for Volta's platform — translating complex... ...understood as clearly as it's engineered. What You Will Be Doing Own... ..., ideally in infrastructure, cloud, or technical B2B products Strong...
Cloud
Senior
Full time
Volta
Palo Alto, CA
1 day ago
Sr. Embedded Linux & Platform Engineer
$135.1k - $168.9k
...generations. Role Summary As a Sr. Embedded Software Engineer, you will be a technical leader in the development... ...the gap between hardware enablement and secure cloud-ready applications, you will ensure our platforms are highly automated, secure, and optimized for...
Cloud
Senior
Full time
Contract work
Part time
Local area
Rivian
Palo Alto, CA
2 days ago
Senior Software Engineer - Notification Platform
$174.9k - $222k
Job Description The Role As a Senior Software Engineer on GM's Notification Platform team , you will play a key role in designing, building, and evolving the systems... ...of best practices for distributed systems and cloud-native design. Mentor junior engineers through design...
Cloud
Senior
Temporary work
Work experience placement
Local area
Work from home
Relocation package
Flexible hours
General Motors
Mountain View, CA
4 days ago
Senior Software Engineer - AI Platform
...operating systems to zonal controllers to cloud and connectivity solutions, we’re... ...a typical company. Role Summary As a Senior Software Engineer specializing in agentic applications,... ...influential voice in shaping our GenAI platform's architecture and strategy. You will...
Cloud
Senior
Full time
Contract work
Local area
Rivian VW Group
Palo Alto, CA
4 days ago
Senior Director of Software Engineering - Head of Cloud/Lake Platforms Engineering
...Job Description We are seeking a Head of Cloud Lake, Storage, and Compute Platform Services to join our Data Platforms organization.... ...rapidly develop products and solutions. As a Senior Director of Software Engineering at JPMorgan Chase within the Consumer &...
Cloud
Senior
JPMorgan Chase & Co.
Palo Alto, CA
more than 2 months ago
Senior Architect
$73.8k - $220.4k
...technologies. The compensation for this position ranges from $73,800 to $220,400, depending on experience. The ideal candidate will have a strong understanding of knowledge engineering and hands-on experience with cloud platforms and machine learning. #J-18808-Ljbffr...
Cloud
Senior
Accenture
Mountain View, CA
10 hours ago
Senior Network Engineer
...The network is not a layer underneath our platform, it is part of it. Fabric design,... ...observes all of it sit in one platform engineering team, deliberately. This is a network role... ...What You Bring 4+ years in data center or cloud network engineering, in production environments...
Cloud
Senior
Full time
Immediate start
Shift work
Volta
Palo Alto, CA
2 days ago
Senior Software Engineer, Cloud Platform
$175k - $225k
...leading vector database for enterprise-grade AI. Founded by the engineers behind Milvus, the world's most popular open-source vector... ...performance, and cost efficiency. You'll join a small, fast-moving Cloud Platform team building the core platform capabilities that run Zilliz...
Cloud
Senior
Zilliz
Redwood City, CA
3 days ago
Senior DevOps Engineer
...Job Title: Senior DevOps Engineer Location: Palo Alto (Hybrid) Duration: 6+ months with possibility... ...Management: Deploy and maintain cloud-based infrastructure on AWS (S3, Aurora... ..., pre-production, production). • Platform Stability & Uptime: Monitor cross-account...
Cloud
Senior
BayOne Solutions
Palo Alto, CA
10 hours ago
Senior Data Engineer
Senior Data Engineer We are looking for a Senior Data Engineer to work with a leading generative... ...What You'll Do Build and operate data platforms and pipelines (batch/stream) that feed... ...Prefect, dbt, Airflow, Spark, and cloud data warehouses (Snowflake/BigQuery/Redshift...
Cloud
Senior
Work at office
Investigo
Palo Alto, CA
2 days ago
Senior Platform Engineer, Observability and AIOps
$165k - $248k
...We Are Synopsys is the leader in engineering solutions from silicon to systems, enabling... ...tomorrow. You Are You are a strong platform engineer with a passion for building... ...compute infrastructure, storage, networking, cloud services, and business-critical...
Cloud
Senior
Synopsys Inc
Sunnyvale, CA
a month ago
Senior IT Systems Engineer
$148k - $222k
...highly motivated, and focused on engineering excellence. This organization... ...SaaS integrations. This senior individual contributor role serves... ...optimization of core SaaS platforms including Okta, Google... ...experience in Azure, AWS, and/or GCP cloud platforms. ~ Exceptional...
Cloud
Senior
Permanent employment
Full time
Temporary work
SpaceXAI
Palo Alto, CA
1 day ago
Senior Lead Software Engineer- Manager
...Senior Lead Software Engineer As a Senior Lead Software Engineer at JPMorgan Chase within Enterprise technology AI/ML Data Platforms team, you will be instrumental in building scalable, resilient... ...production support role with AWS Cloud, Databricks, Snowflake or similar...
Cloud
Senior
Chase
Palo Alto, CA
4 days ago
Senior DevOps Engineer
...dependencies storage. Build and improve on our cloud and GPU cluster, and create essential... ...~5+ years in DevOps / SRE / platform / infra, with ownership of production systems... ...Rust, Go ~ Strong knowledge of software engineering best practices and design patterns ~...
Cloud
Senior
Rhoda AI
Mountain View, CA
3 days ago
Senior DevOps Engineer
...’ll help define it. Role Overview Senior DevOps Engineer – architect and maintain the core infrastructure... ...scalable infrastructure: build cloud environments and continuous deployment... ...evolve our DevOps capabilities as the platform grows. Build and maintain end‑to‑...
Cloud
Senior
Full time
New Code Inc
Palo Alto, CA
3 days ago
Senior Platform Software Engineer (Java)
...Job Title: Senior Platform Software Engineer (Java) - Ex-Apple Preferred Experience Required 8+ Years Mandatory Requirement: Candidate... ...(Java) to design, develop, and support scalable, cloud-native enterprise applications. The ideal candidate will...
Cloud
Senior
Purple Drive
Sunnyvale, CA
9 days ago
Senior Staff ML Engineer
$150k - $300k
...Overview: We are seeking an accomplished Senior Staff ML Engineer who will serve as a technical leader... ...and build solutions using GEICO’s AI platform architecture. Partner with platform teams... ...AI/ML applications and systems in cloud environments 5+ years owning end-to-end...
Cloud
Senior
Hourly pay
Full time
Work experience placement
Local area
GEICO
Palo Alto, CA
1 day ago
Sr Software Engineer, Ad experience platform
$210k - $250k
...Ad Experience Team Engineer The Ad Experience team builds the products that greet every... ...Smart TV applications on the Samsung Tizen platform, where performance isn't optional. These... ...the OS platform layer through the TV UI to cloud integration with DSPs and Ad Exchange. We...
Cloud
Senior
Hourly pay
Full time
Worldwide
Samsung
Mountain View, CA
4 days ago
Senior Software Engineer, Data Platform
$200k - $220k
...manufacturing, data center construction, and cloud services. If you want to do the most... ...This Role: Join Crusoe Energy as a Senior Data Engineer, an early and pivotal hire on our growing... ...and build the foundational data platform infrastructure that powers Crusoe's AI...
Cloud
Senior
Full time
Temporary work
Crusoe Energy Systems
Sunnyvale, CA
5 days ago
Senior Product Marketing Manager
$124k - $195.5k
...This role is focused on working with our cloud partners to jointly launch products, drive... ...value proposition for NVIDIA data center platform and GPU-accelerated cloud instances. The... ...changed data center computing. It is the engines fueling the modern artificial intelligence...
Cloud
Senior
Full time
NVIDIA
Santa Clara, CA
20 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Cloud Platform Engineer. Be the first to apply!