Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior Cloud Platform Engineer

Softbank Investment Advisers

Senior Cloud Platform Engineer

Join SambaNova Systems, a leader in frontier tech, and help shape the future of AI computing. We are disrupting the AI and high-performance computing space with our integrated hardware and software platform. Our DataScale systems and SambaFlow software are pushing the boundaries of what's possible with generative AI and large language models. We are a team of passionate innovators tackling some of the world's most challenging computational problems.

As a Senior Cloud Site Reliability Engineer (SRE) specializing in our AI Inferencing Service, you will be the guardian of its reliability, performance, and scalability. You will bridge the gap between software development and operations, applying an engineering mindset to solve operational challenges. Your primary focus will be ensuring our inference endpoints have exceptional uptime, low-latency response times, and efficient resource utilization, directly impacting the experience of our customers and the success of our AI products. This role includes participating in a shared on-call rotation to maintain 24/7 service reliability.

Service Ownership & On-Call: Take shared ownership of the production inferencing service, including its availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning across multiple regions. This includes implementing and supporting AI infrastructure in new regions, such as Asia, Europe, and Latin America, to support the growth of our business. Participate in a balanced on-call rotation to provide 24/7 support for the service.

On-Call & Work-Life Balance: We believe a sustainable on-call schedule is critical for long-term success and team health. Our on-call philosophy is built on the following principles:

  • Balanced Rotation: The on-call rotation is shared equally across the team, typically following a primary/secondary (follow-the-sun) model to ensure no single person bears a disproportionate burden.
  • Focus on Prevention: We invest heavily in automation, robust testing, and system design to prevent pages before they happen. The goal of on-call is not to heroically fight fires, but to manage rare, complex failures and use those learnings to make the system more resilient.
  • Actionable Alerts: We have a strict policy against alert fatigue. Alerts must be actionable and require immediate human intervention.
  • Incident Management: Lead the response to incidents affecting the inferencing service, driving blameless post-mortems and implementing corrective actions to prevent recurrence.
  • Monitoring & Alerting: Develop and maintain advanced monitoring, alerting, and dashboarding (using tools like Prometheus, Grafana, Datadog) to gain deep insights into service health, model performance (e.g., latency, throughput, error rates), and accelerator utilization. A key responsibility is ensuring alerts are actionable and have a low false-positive rate, minimizing on-call fatigue.
  • Performance & Scalability: Proactively identify and eliminate performance bottlenecks. Design and implement auto-scaling policies to handle variable inference loads cost-effectively. Use insights from on-call incidents to drive improvements that enhance system stability and scalability.
  • Infrastructure as Code (IaC): Manage and evolve our cloud infrastructure (on AWS, GCP, and/or Azure along with on-prem) using tools like Terraform and Ansible, ensuring it is secure, repeatable, and scalable.
  • CI/CD & Automation: Champion automation by building and improving CI/CD pipelines for the seamless and safe deployment of new model versions and service updates. A core goal is to automate manual toil identified during on-call shifts, reducing future operational overhead.
  • Capacity Planning: Forecast infrastructure needs based on product roadmaps and usage trends. Work with finance and engineering teams to manage cloud costs and optimize spending.
  • SLOs & SLIs: Define, measure, and report on Service Level Objectives (SLOs) and Indicators (SLIs) for the inferencing platform, using data to drive prioritization and reliability investments.

What We're Looking For (Must-Haves):

  • Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
  • 5-8+ years of experience in a Site Reliability Engineer, DevOps, or related role supporting a large-scale, customer-facing service in a public cloud environment (AWS, GCP, Azure).
  • Strong programming/scripting skills in languages like Python, Go, or Java.
  • Proven experience with containerization and orchestration technologies (Docker, Kubernetes).
  • Deep understanding of monitoring and observability principles and tools (e.g., Prometheus, Grafana, ELK Stack, Datadog).
  • Solid experience with Infrastructure as Code (e.g., Terraform, CloudFormation).
  • Familiarity with CI/CD principles and tools (e.g., Jenkins, GitHub Actions, ArgoCD).
  • Excellent problem-solving skills and a systematic approach to troubleshooting complex distributed systems.

What Will Make You Stand Out (Nice-to-Haves):

  • Experience in a hybrid environment bridging cloud and on-premise/data center infrastructure.
  • Direct experience supporting ML/AI inferencing services in production.
  • Familiarity with GPU-accelerated computing and optimizing workloads for NVIDIA GPUs for purposes of mapping to RDUs.
  • Knowledge of model serving frameworks like vLLM, SGLang or Ray.
  • Understanding of MLOps principles and practices.
  • Experience with managing and tuning databases (SQL or NoSQL) and caching systems (Redis, Memcached).
  • Strong Linux/Unix system administration fundamentals.

Why SambaNova?

  • Massive Impact: You will be a key part of a critical platform with high visibility and direct impact on our product and engineers.
  • Cutting-Edge Technology: Work with a world-class team on one of the most advanced AI stacks in the industry.
  • Autonomy and Growth: We trust you to make technical decisions. This is a greenfield opportunity to build something remarkable from the ground up.
  • Competitive Compensation: Including equity, excellent benefits, and a flexible work environment.
Vacancy posted 1 hour ago
Similar jobs that could be interesting for youBased on the Senior Cloud Platform Engineer in Palo Alto, CA vacancy
  •  ...Job Title 12+ years in platform engineering, SRE, or DevOps. Experience with HPC clusters (Slurm, PBS, Grid Engine). Cloud infrastructure expertise (GCP/AWS preferred). Proficiency with Terraform, Ansible, Prometheus, Grafana, ELK. Strong Linux administration... 
    Cloud
    Senior

    Saxon Global

    Mountain View, CA
    3 days ago
  •  ...technology company in Mountain View is seeking a candidate with hands-on experience in building multi-tenant SaaS platforms. The role requires knowledge of Kubernetes, cloud services (AWS/GCP/Azure), and the ability to integrate AI tools for enhancing productivity.... 
    Cloud
    Senior

    ThoughtSpot

    Mountain View, CA
    2 days ago
  • black.ai is looking for a skilled platform engineer in Palo Alto to enhance our AWS infrastructure and support quantum simulations. This role requires strong experience in platform engineering, DevOps practices, and GPU workloads. As a platform engineer, you will improve... 
    Cloud
    Senior

    Black Inc

    Palo Alto, CA
    5 days ago
  • A leading financial services firm is looking for a Head of Cloud Lake, Storage, and Compute Platform Services to lead their Data Platforms organization. The role involves managing multiple technical areas, driving technology objectives, and ensuring the adoption of best... 
    Cloud
    Senior

    JPMorgan Chase & Co.

    Palo Alto, CA
    3 days ago
  •  ...North America Inc is looking for a Principal Site Reliability Engineer to join their Platform Engineering team in Palo Alto, CA. This role focuses on...  ...and ensuring the reliability of large-scale, multi-cloud infrastructure through coding and technical leadership. The... 
    Cloud
    Senior

    Uniphore Technologies North America Inc

    Palo Alto, CA
    1 day ago
  •  ...Inc. is seeking a Principal Site Reliability Engineer in Palo Alto, CA. This role focuses on designing and operating cloud infrastructure, requiring extensive experience...  ...has over 10 years of experience in DevOps or Platform Engineering, with a strong emphasis on... 
    Cloud
    Senior

    Uniphore Technologies Inc.

    Palo Alto, CA
    3 days ago
  •  ...Senior Lead Software Engineer Be an integral part of an agile team that's constantly pushing the envelope...  ...the Corporate Sector, Infrastructure Platforms team, you are an integral part of an...  ...and deploy secure, scalable cloud platforms optimized for AI/ML workloads... 
    Cloud
    Senior
    For contractors

    Chase

    Palo Alto, CA
    4 days ago
  •  ...contribute to the decentralization of the future. The Role As a Senior Platform Engineer, you'll own the backend systems and infrastructure that...  ...‑as‑code Build and operate scalable, reliable systems on cloud infrastructure (AWS, GCP, or similar) Develop high‑quality,... 
    Cloud
    Senior
    Remote work
    Flexible hours

    SAGA

    Los Altos, CA
    2 days ago
  •  ...in Mountain View, CA is seeking a Principal Software Engineer for Ads Infrastructure. This senior role involves defining architecture for large-scale systems...  ...should have extensive experience with Kubernetes and cloud-native technologies, along with a proven track record... 
    Cloud
    Senior

    Israelvcforum

    Mountain View, CA
    5 days ago
  • Israelvcforum is looking for a Senior Engineer who will be responsible for enhancing developer productivity through high-quality CI experiences...  .... The ideal candidate should possess extensive experience in cloud production systems and be proficient in languages like Go and... 
    Cloud
    Senior
    Work at office
    3 days per week

    Israelvcforum

    Mountain View, CA
    3 days ago
  • A leading technology firm is seeking a Senior Platform Engineer to manage backend systems and infrastructure. You will design and maintain backend...  ...experience, strong Golang skills, and familiarity with cloud infrastructure like AWS or GCP. The position offers a flexible... 
    Cloud
    Senior
    Remote job
    Flexible hours

    SAGA

    Los Altos, CA
    5 days ago
  •  ...marketplace business located in Palo Alto, CA, is seeking a Senior Software Engineer for their Data Platforms team. In this hybrid role, you will design and build...  ...with a strong background in distributed systems and cloud technologies. Competitive salary and benefits are... 
    Cloud
    Senior

    Mudflapinc

    Palo Alto, CA
    1 day ago
  • $160k - $240k

     ...healthcare. Job Summary: We're looking for a skilled Platform Engineer to contribute to the development of our Gen AI for Healthcare...  ...architecture and RESTful APIs Experience with one of the major cloud platforms (AWS, GCP, or Azure) and infrastructure as code (... 
    Cloud
    Senior
    Live in
    Flexible hours
    3 days per week

    Qualified Health

    Palo Alto, CA
    4 days ago
  • $225k - $300k

     ...Senior Platform Engineer Palo Alto, California, United States DataHub is an AI & Data Context Platform adopted by over 3,000 enterprises...  ...extensibility. The company's enterprise SaaS offering, DataHub Cloud, delivers a fully managed solution with AI-powered... 
    Cloud
    Senior
    Work at office
    Local area
    Remote work
    Worldwide
    Home office
    Flexible hours

    Acryl Data, Inc.

    Palo Alto, CA
    4 days ago
  • Overview We are seeking a Head of Cloud Lake, Storage, and Compute Platform Services to join our Data Platforms organization. This team...  ...and rapidly develop products and solutions. As a Senior Director of Software Engineering at JPMorgan Chase within the Consumer &... 
    Cloud
    Senior

    JPMorgan Chase & Co.

    Palo Alto, CA
    4 days ago
  •  ...Job Description The Role: The Senior Ansible Automation & Platform Engineer is a strategic member of the organization’s Ansible Automation Platform...  ..., configuration management, patching, compliance, and cloud infrastructure. Integrate Ansible with Terraform, CI... 
    Cloud
    Senior
    H1b
    Local area
    Work from home
    Relocation package

    General Motors

    Mountain View, CA
    2 days ago
  • $146.9k - $183.6k

     ...protect it for future generations. Role Summary Are you a Senior Platform Engineer passionate about developer experience (DevEx), build and...  ...CI/CD platforms and container registries. Kubernetes & Cloud: Hands-on experience with container orchestration platforms... 
    Cloud
    Senior
    Full time
    Contract work
    Temporary work
    Part time
    Local area
    Shift work

    Rivian

    Palo Alto, CA
    3 days ago
  •  ...Team Overview PsiQuantum's Applications Software Engineering Team builds tools for quantum algorithm developers: cloud development environments, circuit design tools,...  ...-tolerant quantum computer. We're hiring a platform engineer who bridges software infrastructure with... 
    Cloud
    Senior

    Black Inc

    Palo Alto, CA
    5 days ago
  •  ...Job Title 7+ years building production cloud services at scale. Strong proficiency in Python and modern API design. Hands-on experience shipping agentic AI or tool-using agent systems to production. Proven track record of building reliable, observable systems... 
    Cloud
    Senior

    Saxon Global

    Redwood City, CA
    4 days ago
  • $130k

     ...Role: Platform Engineer or Infrastructure Engineer Location - Palo Alto, CA ( Hybrid) Salary-$130k Key Skills: 6-10 years...  ...Kubernetes (clusters, networking, operators) Distributed systems & cloud-native design AWS / GCP (compute, storage, IAM, networking... 
    Cloud
    Senior

    Diverse Lynx

    Palo Alto, CA
    2 days ago
  • $126k - $248k

     ...About the Role We’re looking for a Senior Engineer to help build the next‑generation inference platform that supports embedding models used for semantic search, retrieval...  ..., GPU utilization, and resource efficiency in a cloud‑native environment Work across product,... 
    Cloud
    Senior
    Local area

    The Consulting Solutions

    Palo Alto, CA
    1 day ago
  •  ...leading company in data and AI infrastructure is seeking a Software Engineer with a backend focus to work on high-scale service and...  ...managing resource management infrastructure, developing scalable cloud services, and supporting Databricks engineers across multiple environments... 
    Cloud
    Senior

    Databricks Inc.

    Mountain View, CA
    5 days ago
  • $185k - $298k

    Palo Alto Networks, Inc. is seeking a Senior Manager, Software Engineering to lead teams building cloud platform services for machine identities at scale. You will be responsible for overseeing the development of distributed systems, mentoring engineering managers, and... 
    Cloud
    Senior

    Palo Alto Networks, Inc.

    Santa Clara, CA
    2 days ago
  • $126k - $203.5k

    Palo Alto Networks, Inc. is seeking a Senior Staff Production Engineer to design and build foundational cloud platform capabilities. This role involves working with infrastructure, software engineering, and production reliability to improve developer productivity and system... 
    Cloud
    Senior

    Palo Alto Networks, Inc.

    Santa Clara, CA
    5 days ago
  • $224k - $431.25k

    NVIDIA Gruppe is looking for a Senior System Software Engineer for Cloud in Santa Clara, California. You will design, build, and deploy cloud-based solutions for GeForce NOW, focusing on scalability and reliability. The ideal candidate has 12+ years of experience in software... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • $168k - $322k

    NVIDIA Gruppe is seeking a Senior AI Platform Engineer to improve engineering efficiency and data security through AI-powered products. The role involves working with Cloud and AI/ML teams to build and scale infrastructure and shape the technological future of the organization... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    5 days ago
  • $184k - $356.5k

    NVIDIA Gruppe is seeking a Senior Engineer to lead the evolution of the core NIM Platform SDK and microservice framework in Santa Clara, California. This hands-on...  ...systems programming and significant experience with cloud-native architectures, contributing to production-... 
    Cloud
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    5 days ago
  • $175k - $225k

    Zilliz, based in Redwood City, California, is seeking a talented engineer to develop their cloud platform for AI applications. With a focus on scalability and performance, you'll work with cutting-edge technologies and highly integrated systems. The role requires at least... 
    Cloud
    Senior

    Medium

    Redwood City, CA
    5 days ago
  • A tech company is seeking a Senior DevOps Engineer to enhance and automate its infrastructure for a site-builder platform. This position focuses on creating robust CI/CD pipelines...  ...with Kubernetes, CI/CD practices, and cloud services in AWS. Responsibilities include... 
    Cloud
    Senior

    TechDigital Group

    Santa Clara, CA
    3 days ago
  • $170k - $230k

    General Motors is hiring a Senior Platform Engineer to enhance the Autonomous Vehicle (AV) Cloud Engineering team. The role involves building and evolving platform capabilities that facilitate faster AV development. Ideal candidates will have a strong background in Kubernetes... 
    Cloud
    Senior

    General Motors

    Sunnyvale, CA
    5 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Cloud Platform Engineer. Be the first to apply!