Member of Technical Staff (Cluster Management)
Fireworks AI
Requirements This role is for someone passionate about operating highly robust, observable, and automated systems and enabling customer successes
- Bachelor's degree in Computer Science, related technical field, or equivalent practical experience
- 5+ years of experience in Site Reliability Engineering, DevOps, or a similar role focused on large-scale production systems
- Deep expertise in SRE principles and practices, including SLOs, SLIs, operational automation, incident management, and post-mortems
- Extensive hands-on experience with public cloud platforms (AWS, GCP, Azure), including compute, networking, storage, and database services
- Strong experience with containerization technologies (Docker) and orchestration platforms (Kubernetes)
- Proficiency in designing and implementing robust monitoring, logging, and alerting systems using tools like Prometheus, Grafana, ELK stack, and distributed tracing
- Solid programming/scripting skills in at least one language (e.g., Python, Go) for automation and tool development
- In-depth knowledge of Linux operating systems, networking fundamentals, and system debugging
- Proven ability to troubleshoot complex issues across the entire stack
- Excellent communication, collaboration, and problem-solving skills
- Willingness to participate in on-call rotations
- (Desirable) Experience of managing data center grade GPU clusters with GPU (and peripherals like HBM and RDMA enabled networking) monitoring, troubleshooting, and fixing
- (Desirable) Experience with machine learning infrastructure, model serving, or distributed AI frameworks
- (Desirable) Hands-on experience in security and data protection
- You will apply your expertise in large-scale distributed systems, cloud infrastructure, and operational excellence
- You will partner closely with world-class software engineers and AI experts to scale cutting-edge AI platforms to meet the fast-growing demands and ever-evolving application paradigms
- Ensuring System Reliability: Ensure systems are designed and implemented with high availability, scalability, and performance. Focus on fault tolerance, disaster recovery, identifying and removing scaling bottlenecks, and performance optimization across our multi-cloud infrastructure
- Incident Management & Response: Lead efforts in incident detection, response, and resolution for critical production issues. Drive post-mortems to identify root causes and implement preventative measures to improve system reliability
- Observability & Monitoring: Develop, implement, and maintain comprehensive monitoring, alerting, logging, and tracing solutions to provide deep insights into system health and performance
- Automation & Toil Reduction: Identify and automate repetitive operational tasks to reduce toil and improve operational efficiency. Develop tools and scripts to streamline deployments, scaling, and system management
- Capacity Planning & Performance Tuning: Work proactively on capacity planning to ensure our infrastructure can gracefully handle growth and peak loads. Optimize system performance and resource utilization
- Reliability Best Practices: Collaborate with software engineers to embed reliability principles (e.g., SLOs, SLIs, error budgets) into the development lifecycle, promoting a culture of operational excellence
- On-call Rotation: Participate in a periodic on-call rotation to support our production environment and respond to critical alerts
Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Member of Technical Staff (Cluster Management) in San Francisco, CA vacancy
$160k - $250k
...build. What you'll own & build As a Member of Technical Staff within the Research Tribe, you’ll be... ...orchestration layers, structured state management, persistent and scoped memory, tool-... ...outputs into high-signal insights. You’ll cluster observations across runs, implement...SuggestedWork at officeWeekend work3 days per week- ...architecture, private DNS, load balancers, advanced routing (BGP), large-scale Kubernetes clusters, control plane scaling. Identity & Security Foundations: IAM, OIDC, RBAC, KMS, secrets management, policy-as-code (OPA, Gatekeeper), secure defaults, cloud guardrails....SuggestedRelocation package
$150k - $300k
...launch LoRA and full fine-tuning runs on managed GPU clusters with a single API call or a few clicks... ...that runs the jobs. Core Technical Responsibilities Hosted Training Infrastructure... ...open development and encourage team members to contribute to the broader AI...SuggestedWork at officeLocal areaRemote workVisa sponsorshipRelocation packageFlexible hours$150k - $300k
...developer-facing platform for AI workload management You will work on a distributed system... ...fast, robust, and reliable at scale. Core Technical Responsibilities Infrastructure... ...in open development and encourage team members to contribute to the broader AI community...SuggestedWork at officeRemote workVisa sponsorshipRelocation packageFlexible hours- Envoy builds workspace management technology that it simple to run secure... ...typically come from staff or principal-level roles and... ...recognized for establishing technical direction, leading large-scale... ...surrounding ecosystem, including cluster architecture, scaling strategies...SuggestedWork at officeLocal areaMonday to Thursday
- Member of Technical Staff, Applied AI The opportunity We are looking for a Member of Technical Staff with deep expertise in generative modelling... ...invented latent diffusion, and built pioneering lab data management systems as well as high throughput protein screening...Flexible hours
$150k
...Technical Program Manager Join Amazon's Frontier AI & Robotics team as a Member of Technical Staff. This Technical Program Manager will become the driving force behind breakthrough robotics innovation. You'll orchestrate complex, cross-functional programs that bridge...Local areaDay shift- ...AI applications. We are looking for team members who love building enabling systems that... ...We're looking for folks with experience managing cloud infrastructure, working through various... ...optimize cloud resources and Kubernetes clusters for cost-effectiveness and performance....Work at office
$180k - $350k
...into every layer of the stack, and you'll manage relationships with external penetration... ...validate our defenses. Core Technical Responsibilities Preventive Security... ...and access control across distributed GPU clusters and cloud infrastructure Build secure...Work at officeRemote workVisa sponsorshipRelocation packageFlexible hours$125k - $200k
...supposed to solve. Our first product replaces the entire order management team at food distribution companies. Instead of emails, texts,... ...core AI agent system from the ground up Making critical technical decisions that will shape our product's future Building and...Full timeTemporary workCurrently hiringImmediate startFlexible hours$117.2k - $176.7k
## Member of Technical Staff (MTS), Software Engineer, Identity & Access ManagementApplyremote type: Office Tech-Flexiblelocations: Washington... ...hiring a Software Engineer to join our Identity and Access Management (IAM) team. In this role, you will help design and build...Work at office$300k
Member of Technical Staff - RL Infrastructure About V max V max is an applied research lab developing AI capable of open-ended learning. We are... ...training, rollout generation, eval execution, data movement, and cluster utilization. Maintain engineering standards for RL...Work at officeLocal area- Member of Technical Staff, ML Infrastructure & Inference Overview We are a cutting-edge AI infrastructure company is building a scalable cloud platform... ...agentic AI applications, allowing customers to deploy and manage workloads through simple APIs without handling low-level...
- ...for agentic workloads. Customers use Gimlet to deploy and manage their workloads through stable, production-ready APIs, without... ...gigawatt-class AI datacenters. Gimlet Labs is seeking a Member of Technical Staff focused on compilers. In this role, you will work on the core...
- ...for agentic workloads. Customers use Gimlet to deploy and manage their workloads through stable, production-ready APIs, without... ...gigawatt-class AI datacenters. Gimlet Labs is seeking a Member of Technical Staff (Intern) to help develop Gimlet’s platform for deploying and...Internship
- Job Description As a Member of Technical Staff (Research) at Trajectory, you will design and build the post‑training stack that lets our customers... ...Experience with high-performance computing or large-scale clusters Contributions to open-source ML research or...
$150k - $300k
...our RL training stack. Core Technical Responsibilities LLM Serving... ...and cold‑start times across clusters. Inference Optimization & Performance... ..., prefix caching, memory management and other axes for maximum... ...development and encourage team members to contribute to the broader...Work at officeRemote workVisa sponsorshipRelocation packageFlexible hoursShift work- Member of Technical Staff - Infrastructure Security We're partnering with a frontier AI research company that is building next-generation open... ...organization Skills required Deep expertise securing Kubernetes clusters and containerized workloads Strong experience with GCP,...
- ...for agentic workloads. Customers use Gimlet to deploy and manage their workloads through stable, production-ready APIs, without... ...to gigawatt-class AI datacenters. Gimlet Labs is seeking a Member of Technical Staff focused on distributed systems. In this role, you will...
- ...Overview Reflection.AI is looking for a Member of Technical Staff - Infrastructure Security to secure... ...disparate multi-cloud Kubernetes clusters Unify and harden authentication across... ...containerized workloads Understanding of managed Kube cluster services such as GKE...Full timeRelocation package
- Job Description As a Member of Technical Staff (Research) at Trajectory, you will design and build the post‑training stack that lets our customers... ...Experience with high-performance computing or large-scale clusters Contributions to open-source ML research or infrastructure...
- ...for agentic workloads. Customers use Gimlet to deploy and manage their workloads through stable, production‑ready APIs, without... ...‑class AI datacenters. Mission Gimlet Labs is seeking a Member of Technical Staff focused on ML systems and inference. In this role, you will...
$150k - $350k
...for agentic workloads. Customers use Gimlet to deploy and manage their workloads through stable, production‑ready APIs, without... ...‑class AI datacenters. Mission Gimlet Labs is seeking a Member of Technical Staff focused on kernels and GPU performance. In this role, you will...$150k
Description Amazon’s Frontier AI & Robotics (FAR) team is seeking a Member of Technical Staff to drive foundational research and build intelligent... ...engineers, and fellow scientists Leverage our compute cluster and advanced robotics lab—including high‑DoF prototype platforms...- Envoy builds workspace management technology that makes it simple to run secure, compliant, and connected workplaces across every location... ...the ability to contribute independently to well-scoped technical projects in a collaborative team environment. About the role...Work experience placementWork at officeLocal areaShift work
- ...spaces, and communications in one secure, integrated workplace management platform and ecosystem. More than 16,000 workplaces around the... ...performance and observability Participate in code reviews, technical discussions, and continuous improvement of engineering quality...Work at officeLocal areaWorldwideMonday to ThursdayFlexible hours
- ...possible in robotic intelligence. As a Member of Technical Staff, you'll be at the forefront of... ...scientists Leverage our massive compute cluster and extensive robotics infrastructure... ...exercise sound judgment, effectively manage stress and work safely and respectfully...Local area
- Member of Technical Staff — AI/ML Engineering (Financial Technology) Build intelligent systems that redefine how businesses manage financial operations. A rapidly growing financial technology platform is modernizing accounts receivable processes for B2B organizations,...Full timeFlexible hours
- Member of Technical Staff — Full Stack Engineering (Financial Technology) Build the systems that power modern financial operations from end to end... ...technology platform is redefining how B2B organizations manage accounts receivable—replacing outdated, manual processes with...Permanent employmentFull timeContract workFlexible hours
- Member of Technical Staff - Strategic Projects Lead Patronus AI is a frontier lab developing simulation research and infrastructure to accelerate... ..., incorporating feedback, communicating progress, and managing expectations week to week. Partner with technical, research...
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Member of Technical Staff (Cluster Management). Be the first to apply!
Related searches
- IT assistant San Francisco, CA
- desktop support analyst San Francisco, CA
- senior IT support technician San Francisco, CA
- personal computer support technician San Francisco, CA
- technical analyst San Francisco, CA
- customer support technician San Francisco, CA
- tech assistant San Francisco, CA
- technical support assistant San Francisco, CA
- customer support analyst San Francisco, CA
- remote (work from home) technical support representative San Francisco, CA

