Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Member of Technical Staff (Cluster Management)

Fireworks AI

Requirements This role is for someone passionate about operating highly robust, observable, and automated systems and enabling customer successes

  • Bachelor's degree in Computer Science, related technical field, or equivalent practical experience
  • 5+ years of experience in Site Reliability Engineering, DevOps, or a similar role focused on large-scale production systems
  • Deep expertise in SRE principles and practices, including SLOs, SLIs, operational automation, incident management, and post-mortems
  • Extensive hands-on experience with public cloud platforms (AWS, GCP, Azure), including compute, networking, storage, and database services
  • Strong experience with containerization technologies (Docker) and orchestration platforms (Kubernetes)
  • Proficiency in designing and implementing robust monitoring, logging, and alerting systems using tools like Prometheus, Grafana, ELK stack, and distributed tracing
  • Solid programming/scripting skills in at least one language (e.g., Python, Go) for automation and tool development
  • In-depth knowledge of Linux operating systems, networking fundamentals, and system debugging
  • Proven ability to troubleshoot complex issues across the entire stack
  • Excellent communication, collaboration, and problem-solving skills
  • Willingness to participate in on-call rotations
  • (Desirable) Experience of managing data center grade GPU clusters with GPU (and peripherals like HBM and RDMA enabled networking) monitoring, troubleshooting, and fixing
  • (Desirable) Experience with machine learning infrastructure, model serving, or distributed AI frameworks
  • (Desirable) Hands-on experience in security and data protection
What the job involves As a Member of Technical Staff, Cluster Management at Fireworks AI, you will play a critical role in making our world-scale virtual AI cloud reliable, performant, and efficient
  • You will apply your expertise in large-scale distributed systems, cloud infrastructure, and operational excellence
  • You will partner closely with world-class software engineers and AI experts to scale cutting-edge AI platforms to meet the fast-growing demands and ever-evolving application paradigms
  • Ensuring System Reliability: Ensure systems are designed and implemented with high availability, scalability, and performance. Focus on fault tolerance, disaster recovery, identifying and removing scaling bottlenecks, and performance optimization across our multi-cloud infrastructure
  • Incident Management & Response: Lead efforts in incident detection, response, and resolution for critical production issues. Drive post-mortems to identify root causes and implement preventative measures to improve system reliability
  • Observability & Monitoring: Develop, implement, and maintain comprehensive monitoring, alerting, logging, and tracing solutions to provide deep insights into system health and performance
  • Automation & Toil Reduction: Identify and automate repetitive operational tasks to reduce toil and improve operational efficiency. Develop tools and scripts to streamline deployments, scaling, and system management
  • Capacity Planning & Performance Tuning: Work proactively on capacity planning to ensure our infrastructure can gracefully handle growth and peak loads. Optimize system performance and resource utilization
  • Reliability Best Practices: Collaborate with software engineers to embed reliability principles (e.g., SLOs, SLIs, error budgets) into the development lifecycle, promoting a culture of operational excellence
  • On-call Rotation: Participate in a periodic on-call rotation to support our production environment and respond to critical alerts
#J-18808-Ljbffr Fireworks AI

Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Member of Technical Staff (Cluster Management) in San Francisco, CA vacancy
  • $160k - $250k

     ...build. What you'll own & build As a Member of Technical Staff within the Research Tribe, you’ll be...  ...orchestration layers, structured state management, persistent and scoped memory, tool-...  ...outputs into high-signal insights. You’ll cluster observations across runs, implement... 
    Suggested
    Work at office
    Weekend work
    3 days per week

    Blok

    San Francisco, CA
    3 days ago
  •  ...architecture, private DNS, load balancers, advanced routing (BGP), large-scale Kubernetes clusters, control plane scaling. Identity & Security Foundations: IAM, OIDC, RBAC, KMS, secrets management, policy-as-code (OPA, Gatekeeper), secure defaults, cloud guardrails.... 
    Suggested
    Relocation package

    Reflection AI

    San Francisco, CA
    17 hours ago
  • $150k - $300k

     ...launch LoRA and full fine-tuning runs on managed GPU clusters with a single API call or a few clicks...  ...that runs the jobs. Core Technical Responsibilities Hosted Training Infrastructure...  ...open development and encourage team members to contribute to the broader AI... 
    Suggested
    Work at office
    Local area
    Remote work
    Visa sponsorship
    Relocation package
    Flexible hours

    Prime Intellect

    San Francisco, CA
    17 hours ago
  • $150k - $300k

     ...developer-facing platform for AI workload management You will work on a distributed system...  ...fast, robust, and reliable at scale. Core Technical Responsibilities Infrastructure...  ...in open development and encourage team members to contribute to the broader AI community... 
    Suggested
    Work at office
    Remote work
    Visa sponsorship
    Relocation package
    Flexible hours

    Prime Intellect, Inc.

    San Francisco, CA
    4 days ago
  • Envoy builds workspace management technology that it simple to run secure...  ...typically come from staff or principal-level roles and...  ...recognized for establishing technical direction, leading large-scale...  ...surrounding ecosystem, including cluster architecture, scaling strategies... 
    Suggested
    Work at office
    Local area
    Monday to Thursday

    Envoy Inc.

    San Francisco, CA
    1 day ago
  • Member of Technical Staff, Applied AI The opportunity We are looking for a Member of Technical Staff with deep expertise in generative modelling...  ...invented latent diffusion, and built pioneering lab data management systems as well as high throughput protein screening... 
    Flexible hours

    Latent Labs

    San Francisco, CA
    4 days ago
  • $150k

     ...Technical Program Manager Join Amazon's Frontier AI & Robotics team as a Member of Technical Staff. This Technical Program Manager will become the driving force behind breakthrough robotics innovation. You'll orchestrate complex, cross-functional programs that bridge... 
    Local area
    Day shift

    Amazon Technologies, Inc.

    San Francisco, CA
    2 days ago
  •  ...AI applications. We are looking for team members who love building enabling systems that...  ...We're looking for folks with experience managing cloud infrastructure, working through various...  ...optimize cloud resources and Kubernetes clusters for cost-effectiveness and performance.... 
    Work at office

    LlamaIndex

    San Francisco, CA
    3 days ago
  • $180k - $350k

     ...into every layer of the stack, and you'll manage relationships with external penetration...  ...validate our defenses. Core Technical Responsibilities Preventive Security...  ...and access control across distributed GPU clusters and cloud infrastructure Build secure... 
    Work at office
    Remote work
    Visa sponsorship
    Relocation package
    Flexible hours

    Prime Intellect

    San Francisco, CA
    1 day ago
  • $125k - $200k

     ...supposed to solve. Our first product replaces the entire order management team at food distribution companies. Instead of emails, texts,...  ...core AI agent system from the ground up Making critical technical decisions that will shape our product's future Building and... 
    Full time
    Temporary work
    Currently hiring
    Immediate start
    Flexible hours

    burnt

    San Francisco, CA
    17 hours ago
  • $117.2k - $176.7k

    ## Member of Technical Staff (MTS), Software Engineer, Identity & Access ManagementApplyremote type: Office Tech-Flexiblelocations: Washington...  ...hiring a Software Engineer to join our Identity and Access Management (IAM) team. In this role, you will help design and build... 
    Work at office

    Salesforce, Inc.

    San Francisco, CA
    3 days ago
  • $300k

    Member of Technical Staff - RL Infrastructure About V max V max is an applied research lab developing AI capable of open-ended learning. We are...  ...training, rollout generation, eval execution, data movement, and cluster utilization. Maintain engineering standards for RL... 
    Work at office
    Local area

    Vmax

    San Francisco, CA
    4 days ago
  • Member of Technical Staff, ML Infrastructure & Inference Overview We are a cutting-edge AI infrastructure company is building a scalable cloud platform...  ...agentic AI applications, allowing customers to deploy and manage workloads through simple APIs without handling low-level... 

    Acceler8 Talent

    San Francisco, CA
    2 days ago
  •  ...for agentic workloads. Customers use Gimlet to deploy and manage their workloads through stable, production-ready APIs, without...  ...gigawatt-class AI datacenters. Gimlet Labs is seeking a Member of Technical Staff focused on compilers. In this role, you will work on the core... 

    Gimlet Labs

    San Francisco, CA
    1 day ago
  •  ...for agentic workloads. Customers use Gimlet to deploy and manage their workloads through stable, production-ready APIs, without...  ...gigawatt-class AI datacenters. Gimlet Labs is seeking a Member of Technical Staff (Intern) to help develop Gimlet’s platform for deploying and... 
    Internship

    Gimlet Labs

    San Francisco, CA
    2 days ago
  • Job Description As a Member of Technical Staff (Research) at Trajectory, you will design and build the post‑training stack that lets our customers...  ...Experience with high-performance computing or large-scale clusters Contributions to open-source ML research or... 

    Trajectory

    San Francisco, CA
    1 day ago
  • $150k - $300k

     ...our RL training stack. Core Technical Responsibilities LLM Serving...  ...and cold‑start times across clusters. Inference Optimization & Performance...  ..., prefix caching, memory management and other axes for maximum...  ...development and encourage team members to contribute to the broader... 
    Work at office
    Remote work
    Visa sponsorship
    Relocation package
    Flexible hours
    Shift work

    Prime Intellect

    San Francisco, CA
    3 days ago
  • Member of Technical Staff - Infrastructure Security We're partnering with a frontier AI research company that is building next-generation open...  ...organization Skills required Deep expertise securing Kubernetes clusters and containerized workloads Strong experience with GCP,... 

    Xcede

    San Francisco, CA
    2 days ago
  •  ...for agentic workloads. Customers use Gimlet to deploy and manage their workloads through stable, production-ready APIs, without...  ...to gigawatt-class AI datacenters. Gimlet Labs is seeking a Member of Technical Staff focused on distributed systems. In this role, you will... 

    Gimlet Labs

    San Francisco, CA
    1 day ago
  •  ...Overview Reflection.AI is looking for a Member of Technical Staff - Infrastructure Security to secure...  ...disparate multi-cloud Kubernetes clusters Unify and harden authentication across...  ...containerized workloads Understanding of managed Kube cluster services such as GKE... 
    Full time
    Relocation package

    B Capital

    San Francisco, CA
    2 days ago
  • Job Description As a Member of Technical Staff (Research) at Trajectory, you will design and build the post‑training stack that lets our customers...  ...Experience with high-performance computing or large-scale clusters Contributions to open-source ML research or infrastructure... 

    Gravity Engineering Services Pvt Ltd.

    San Francisco, CA
    2 days ago
  •  ...for agentic workloads. Customers use Gimlet to deploy and manage their workloads through stable, production‑ready APIs, without...  ...‑class AI datacenters. Mission Gimlet Labs is seeking a Member of Technical Staff focused on ML systems and inference. In this role, you will... 

    Gimlet Labs, Inc.

    San Francisco, CA
    3 days ago
  • $150k - $350k

     ...for agentic workloads. Customers use Gimlet to deploy and manage their workloads through stable, production‑ready APIs, without...  ...‑class AI datacenters. Mission Gimlet Labs is seeking a Member of Technical Staff focused on kernels and GPU performance. In this role, you will... 

    Gimlet Labs, Inc.

    San Francisco, CA
    1 day ago
  • $150k

    Description Amazon’s Frontier AI & Robotics (FAR) team is seeking a Member of Technical Staff to drive foundational research and build intelligent...  ...engineers, and fellow scientists Leverage our compute cluster and advanced robotics lab—including high‑DoF prototype platforms... 

    Amazon Science

    San Francisco, CA
    17 hours ago
  • Envoy builds workspace management technology that makes it simple to run secure, compliant, and connected workplaces across every location...  ...the ability to contribute independently to well-scoped technical projects in a collaborative team environment. About the role... 
    Work experience placement
    Work at office
    Local area
    Shift work

    Envoy Inc.

    San Francisco, CA
    2 days ago
  •  ...spaces, and communications in one secure, integrated workplace management platform and ecosystem. More than 16,000 workplaces around the...  ...performance and observability Participate in code reviews, technical discussions, and continuous improvement of engineering quality... 
    Work at office
    Local area
    Worldwide
    Monday to Thursday
    Flexible hours

    Envoy

    San Francisco, CA
    3 days ago
  •  ...possible in robotic intelligence. As a Member of Technical Staff, you'll be at the forefront of...  ...scientists Leverage our massive compute cluster and extensive robotics infrastructure...  ...exercise sound judgment, effectively manage stress and work safely and respectfully... 
    Local area

    Amazon Science

    San Francisco, CA
    1 day ago
  • Member of Technical Staff — AI/ML Engineering (Financial Technology) Build intelligent systems that redefine how businesses manage financial operations. A rapidly growing financial technology platform is modernizing accounts receivable processes for B2B organizations,... 
    Full time
    Flexible hours

    Andiamo

    San Francisco, CA
    4 days ago
  • Member of Technical Staff — Full Stack Engineering (Financial Technology) Build the systems that power modern financial operations from end to end...  ...technology platform is redefining how B2B organizations manage accounts receivable—replacing outdated, manual processes with... 
    Permanent employment
    Full time
    Contract work
    Flexible hours

    Andiamo

    San Francisco, CA
    1 day ago
  • Member of Technical Staff - Strategic Projects Lead Patronus AI is a frontier lab developing simulation research and infrastructure to accelerate...  ..., incorporating feedback, communicating progress, and managing expectations week to week. Partner with technical, research... 

    Patronus AI, Inc.

    San Francisco, CA
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Member of Technical Staff (Cluster Management). Be the first to apply!