Senior Cloud Platform Engineer
Softbank Investment Advisers
Senior Cloud Platform Engineer
Join SambaNova Systems, a leader in frontier tech, and help shape the future of AI computing. We are disrupting the AI and high-performance computing space with our integrated hardware and software platform. Our DataScale systems and SambaFlow software are pushing the boundaries of what's possible with generative AI and large language models. We are a team of passionate innovators tackling some of the world's most challenging computational problems.
As a Senior Cloud Site Reliability Engineer (SRE) specializing in our AI Inferencing Service, you will be the guardian of its reliability, performance, and scalability. You will bridge the gap between software development and operations, applying an engineering mindset to solve operational challenges. Your primary focus will be ensuring our inference endpoints have exceptional uptime, low-latency response times, and efficient resource utilization, directly impacting the experience of our customers and the success of our AI products. This role includes participating in a shared on-call rotation to maintain 24/7 service reliability.
Service Ownership & On-Call: Take shared ownership of the production inferencing service, including its availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning across multiple regions. This includes implementing and supporting AI infrastructure in new regions, such as Asia, Europe, and Latin America, to support the growth of our business. Participate in a balanced on-call rotation to provide 24/7 support for the service.
On-Call & Work-Life Balance: We believe a sustainable on-call schedule is critical for long-term success and team health. Our on-call philosophy is built on the following principles:
- Balanced Rotation: The on-call rotation is shared equally across the team, typically following a primary/secondary (follow-the-sun) model to ensure no single person bears a disproportionate burden.
- Focus on Prevention: We invest heavily in automation, robust testing, and system design to prevent pages before they happen. The goal of on-call is not to heroically fight fires, but to manage rare, complex failures and use those learnings to make the system more resilient.
- Actionable Alerts: We have a strict policy against alert fatigue. Alerts must be actionable and require immediate human intervention.
- Incident Management: Lead the response to incidents affecting the inferencing service, driving blameless post-mortems and implementing corrective actions to prevent recurrence.
- Monitoring & Alerting: Develop and maintain advanced monitoring, alerting, and dashboarding (using tools like Prometheus, Grafana, Datadog) to gain deep insights into service health, model performance (e.g., latency, throughput, error rates), and accelerator utilization. A key responsibility is ensuring alerts are actionable and have a low false-positive rate, minimizing on-call fatigue.
- Performance & Scalability: Proactively identify and eliminate performance bottlenecks. Design and implement auto-scaling policies to handle variable inference loads cost-effectively. Use insights from on-call incidents to drive improvements that enhance system stability and scalability.
- Infrastructure as Code (IaC): Manage and evolve our cloud infrastructure (on AWS, GCP, and/or Azure along with on-prem) using tools like Terraform and Ansible, ensuring it is secure, repeatable, and scalable.
- CI/CD & Automation: Champion automation by building and improving CI/CD pipelines for the seamless and safe deployment of new model versions and service updates. A core goal is to automate manual toil identified during on-call shifts, reducing future operational overhead.
- Capacity Planning: Forecast infrastructure needs based on product roadmaps and usage trends. Work with finance and engineering teams to manage cloud costs and optimize spending.
- SLOs & SLIs: Define, measure, and report on Service Level Objectives (SLOs) and Indicators (SLIs) for the inferencing platform, using data to drive prioritization and reliability investments.
What We're Looking For (Must-Haves):
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
- 5-8+ years of experience in a Site Reliability Engineer, DevOps, or related role supporting a large-scale, customer-facing service in a public cloud environment (AWS, GCP, Azure).
- Strong programming/scripting skills in languages like Python, Go, or Java.
- Proven experience with containerization and orchestration technologies (Docker, Kubernetes).
- Deep understanding of monitoring and observability principles and tools (e.g., Prometheus, Grafana, ELK Stack, Datadog).
- Solid experience with Infrastructure as Code (e.g., Terraform, CloudFormation).
- Familiarity with CI/CD principles and tools (e.g., Jenkins, GitHub Actions, ArgoCD).
- Excellent problem-solving skills and a systematic approach to troubleshooting complex distributed systems.
What Will Make You Stand Out (Nice-to-Haves):
- Experience in a hybrid environment bridging cloud and on-premise/data center infrastructure.
- Direct experience supporting ML/AI inferencing services in production.
- Familiarity with GPU-accelerated computing and optimizing workloads for NVIDIA GPUs for purposes of mapping to RDUs.
- Knowledge of model serving frameworks like vLLM, SGLang or Ray.
- Understanding of MLOps principles and practices.
- Experience with managing and tuning databases (SQL or NoSQL) and caching systems (Redis, Memcached).
- Strong Linux/Unix system administration fundamentals.
Why SambaNova?
- Massive Impact: You will be a key part of a critical platform with high visibility and direct impact on our product and engineers.
- Cutting-Edge Technology: Work with a world-class team on one of the most advanced AI stacks in the industry.
- Autonomy and Growth: We trust you to make technical decisions. This is a greenfield opportunity to build something remarkable from the ground up.
- Competitive Compensation: Including equity, excellent benefits, and a flexible work environment.
- ...Job Title 12+ years in platform engineering, SRE, or DevOps. Experience with HPC clusters (Slurm, PBS, Grid Engine). Cloud infrastructure expertise (GCP/AWS preferred). Proficiency with Terraform, Ansible, Prometheus, Grafana, ELK. Strong Linux administration...CloudSenior
- ...technology company in Mountain View is seeking a candidate with hands-on experience in building multi-tenant SaaS platforms. The role requires knowledge of Kubernetes, cloud services (AWS/GCP/Azure), and the ability to integrate AI tools for enhancing productivity....CloudSenior
- black.ai is looking for a skilled platform engineer in Palo Alto to enhance our AWS infrastructure and support quantum simulations. This role requires strong experience in platform engineering, DevOps practices, and GPU workloads. As a platform engineer, you will improve...CloudSenior
- A leading financial services firm is looking for a Head of Cloud Lake, Storage, and Compute Platform Services to lead their Data Platforms organization. The role involves managing multiple technical areas, driving technology objectives, and ensuring the adoption of best...CloudSenior
- ...North America Inc is looking for a Principal Site Reliability Engineer to join their Platform Engineering team in Palo Alto, CA. This role focuses on... ...and ensuring the reliability of large-scale, multi-cloud infrastructure through coding and technical leadership. The...CloudSenior
- ...Inc. is seeking a Principal Site Reliability Engineer in Palo Alto, CA. This role focuses on designing and operating cloud infrastructure, requiring extensive experience... ...has over 10 years of experience in DevOps or Platform Engineering, with a strong emphasis on...CloudSenior
- ...Senior Lead Software Engineer Be an integral part of an agile team that's constantly pushing the envelope... ...the Corporate Sector, Infrastructure Platforms team, you are an integral part of an... ...and deploy secure, scalable cloud platforms optimized for AI/ML workloads...CloudSeniorFor contractors
- ...contribute to the decentralization of the future. The Role As a Senior Platform Engineer, you'll own the backend systems and infrastructure that... ...‑as‑code Build and operate scalable, reliable systems on cloud infrastructure (AWS, GCP, or similar) Develop high‑quality,...CloudSeniorRemote workFlexible hours
- ...in Mountain View, CA is seeking a Principal Software Engineer for Ads Infrastructure. This senior role involves defining architecture for large-scale systems... ...should have extensive experience with Kubernetes and cloud-native technologies, along with a proven track record...CloudSenior
- Israelvcforum is looking for a Senior Engineer who will be responsible for enhancing developer productivity through high-quality CI experiences... .... The ideal candidate should possess extensive experience in cloud production systems and be proficient in languages like Go and...CloudSeniorWork at office3 days per week
- A leading technology firm is seeking a Senior Platform Engineer to manage backend systems and infrastructure. You will design and maintain backend... ...experience, strong Golang skills, and familiarity with cloud infrastructure like AWS or GCP. The position offers a flexible...CloudSeniorRemote jobFlexible hours
- ...marketplace business located in Palo Alto, CA, is seeking a Senior Software Engineer for their Data Platforms team. In this hybrid role, you will design and build... ...with a strong background in distributed systems and cloud technologies. Competitive salary and benefits are...CloudSenior
$160k - $240k
...healthcare. Job Summary: We're looking for a skilled Platform Engineer to contribute to the development of our Gen AI for Healthcare... ...architecture and RESTful APIs Experience with one of the major cloud platforms (AWS, GCP, or Azure) and infrastructure as code (...CloudSeniorLive inFlexible hours3 days per week$225k - $300k
...Senior Platform Engineer Palo Alto, California, United States DataHub is an AI & Data Context Platform adopted by over 3,000 enterprises... ...extensibility. The company's enterprise SaaS offering, DataHub Cloud, delivers a fully managed solution with AI-powered...CloudSeniorWork at officeLocal areaRemote workWorldwideHome officeFlexible hours- Overview We are seeking a Head of Cloud Lake, Storage, and Compute Platform Services to join our Data Platforms organization. This team... ...and rapidly develop products and solutions. As a Senior Director of Software Engineering at JPMorgan Chase within the Consumer &...CloudSenior
- ...Job Description The Role: The Senior Ansible Automation & Platform Engineer is a strategic member of the organization’s Ansible Automation Platform... ..., configuration management, patching, compliance, and cloud infrastructure. Integrate Ansible with Terraform, CI...CloudSeniorH1bLocal areaWork from homeRelocation package
$146.9k - $183.6k
...protect it for future generations. Role Summary Are you a Senior Platform Engineer passionate about developer experience (DevEx), build and... ...CI/CD platforms and container registries. Kubernetes & Cloud: Hands-on experience with container orchestration platforms...CloudSeniorFull timeContract workTemporary workPart timeLocal areaShift work- ...Team Overview PsiQuantum's Applications Software Engineering Team builds tools for quantum algorithm developers: cloud development environments, circuit design tools,... ...-tolerant quantum computer. We're hiring a platform engineer who bridges software infrastructure with...CloudSenior
- ...Job Title 7+ years building production cloud services at scale. Strong proficiency in Python and modern API design. Hands-on experience shipping agentic AI or tool-using agent systems to production. Proven track record of building reliable, observable systems...CloudSenior
$130k
...Role: Platform Engineer or Infrastructure Engineer Location - Palo Alto, CA ( Hybrid) Salary-$130k Key Skills: 6-10 years... ...Kubernetes (clusters, networking, operators) Distributed systems & cloud-native design AWS / GCP (compute, storage, IAM, networking...CloudSenior$126k - $248k
...About the Role We’re looking for a Senior Engineer to help build the next‑generation inference platform that supports embedding models used for semantic search, retrieval... ..., GPU utilization, and resource efficiency in a cloud‑native environment Work across product,...CloudSeniorLocal area- ...leading company in data and AI infrastructure is seeking a Software Engineer with a backend focus to work on high-scale service and... ...managing resource management infrastructure, developing scalable cloud services, and supporting Databricks engineers across multiple environments...CloudSenior
$185k - $298k
Palo Alto Networks, Inc. is seeking a Senior Manager, Software Engineering to lead teams building cloud platform services for machine identities at scale. You will be responsible for overseeing the development of distributed systems, mentoring engineering managers, and...CloudSenior$126k - $203.5k
Palo Alto Networks, Inc. is seeking a Senior Staff Production Engineer to design and build foundational cloud platform capabilities. This role involves working with infrastructure, software engineering, and production reliability to improve developer productivity and system...CloudSenior$224k - $431.25k
NVIDIA Gruppe is looking for a Senior System Software Engineer for Cloud in Santa Clara, California. You will design, build, and deploy cloud-based solutions for GeForce NOW, focusing on scalability and reliability. The ideal candidate has 12+ years of experience in software...CloudSenior$168k - $322k
NVIDIA Gruppe is seeking a Senior AI Platform Engineer to improve engineering efficiency and data security through AI-powered products. The role involves working with Cloud and AI/ML teams to build and scale infrastructure and shape the technological future of the organization...CloudSenior$184k - $356.5k
NVIDIA Gruppe is seeking a Senior Engineer to lead the evolution of the core NIM Platform SDK and microservice framework in Santa Clara, California. This hands-on... ...systems programming and significant experience with cloud-native architectures, contributing to production-...CloudSenior$175k - $225k
Zilliz, based in Redwood City, California, is seeking a talented engineer to develop their cloud platform for AI applications. With a focus on scalability and performance, you'll work with cutting-edge technologies and highly integrated systems. The role requires at least...CloudSenior- A tech company is seeking a Senior DevOps Engineer to enhance and automate its infrastructure for a site-builder platform. This position focuses on creating robust CI/CD pipelines... ...with Kubernetes, CI/CD practices, and cloud services in AWS. Responsibilities include...CloudSenior
$170k - $230k
General Motors is hiring a Senior Platform Engineer to enhance the Autonomous Vehicle (AV) Cloud Engineering team. The role involves building and evolving platform capabilities that facilitate faster AV development. Ideal candidates will have a strong background in Kubernetes...CloudSenior
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior Cloud Platform Engineer. Be the first to apply!
- cloud engineering manager Palo Alto, CA
- informatica cloud developer Palo Alto, CA
- senior cloud data engineer Palo Alto, CA
- cloud engineer Palo Alto, CA
- senior devops cloud engineer Palo Alto, CA
- cloud developer Palo Alto, CA
- devops cloud engineer Palo Alto, CA
- principal cloud computing engineer Palo Alto, CA
- google cloud engineer Palo Alto, CA
- cloud architect Palo Alto, CA

