Member of Technical Staff (AI Infrastructure Engineer)

Perplexity AI

AI Infra Engineer

We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering closely with our Inference and Research teams to build, deploy, and optimize our large-scale AI training and inference clusters.

Responsibilities

Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
Manage and optimize Slurm-based HPC environments for distributed training of large language models
Develop robust APIs and orchestration systems for both training pipelines and inference services
Implement resource scheduling and job management systems across heterogeneous compute environments
Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands

Qualifications

Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization
Experience with deploying and managing distributed training systems at scale
Deep understanding of container orchestration and distributed systems architecture
High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies)
Experience managing GPU clusters and optimizing compute resource utilization

Required Skills

Expert-level Kubernetes administration and YAML configuration management
Proficiency with Slurm job scheduling, resource management, and cluster configuration
Python and C++ programming with focus on systems and infrastructure automation
Hands-on experience with ML frameworks such as PyTorch in distributed training contexts
Strong understanding of networking, storage, and compute resource management for ML workloads
Experience developing APIs and managing distributed systems for both batch and real-time workloads
Solid debugging and monitoring skills with expertise in observability tools for containerized environments

Preferred Skills

Experience with Kubernetes operators and custom controllers for ML workloads
Advanced Slurm administration including multi-cluster federation and advanced scheduling policies
Familiarity with GPU cluster management and CUDA optimization
Experience with other ML frameworks like TensorFlow or distributed training libraries
Background in HPC environments, parallel computing, and high-performance networking
Knowledge of infrastructure as code (Terraform, Ansible) and GitOps practices
Experience with container registries, image optimization, and multi-stage builds for ML workloads

Required Experience

Demonstrated experience managing large-scale Kubernetes deployments in production environments
Proven track record with Slurm cluster administration and HPC workload management
Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure
Experience supporting both long-running training jobs and high-availability inference services
Ideally, 3-5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management

Apply

Vacancy posted 4 days ago

Similar jobs that could be interesting for youBased on the Member of Technical Staff (AI Infrastructure Engineer) in San Francisco, CA vacancy

Member of Technical Staff (AI Engineering)
...servicing with the industry's most advanced AI credit-servicing agents. We are backed by... ...Product Hunt), Charlie Songhurst (Board Member, Meta), and Michael Jones (Former Chair,... ...the United Nations, UChicago, and Oxford engineers and researchers. Our omnichannel agents...
Suggested
Full time
Internship
Worldwide
Krew Research
San Francisco, CA
21 hours ago
Applied AI Engineer (Member of Technical Staff)
$100k - $300k
...Cogent Security Cogent is an Applied AI Lab building the next generation of AI... ...are looking for talented, ambitious AI/ML Engineers who are excited to build in the Applied AI... ...Onboard, support and uplevel future team members Mentor and grow future junior team members...
Suggested
Cogent Security
San Francisco, CA
2 days ago
Member of Technical Staff - Infrastructure Engineer
$180k - $300k
...Member Of Technical Staff - Infrastructure Engineer Freiburg (Germany), San Francisco (USA) About Black Forest Labs We're the team behind Latent Diffusion, Stable Diffusion, and FLUX—foundational technologies that changed how the world creates images and video...
Suggested
Work at office
Remote work
Worldwide
Relocation
2 days per week
Black Forest Labs
San Francisco, CA
4 days ago
Member of Technical Staff - Applied AI Engineer
Member of Technical Staff - Applied AI Engineer Valthos | Posted Mar 3 Full-time Negotiable Advanced (5-10 yrs) Valthos Inc. Valthos is an applied... ...build, deploy, and scale model training and evaluation infrastructure Visualize and communicate results within Valthos...
Suggested
Full time
Work at office
Valthos
San Francisco, CA
1 day ago
Member of Technical Staff: AI Research & Engineering
Member of Technical Staff: AI Research & Engineering in Media Integrity About Synhawk Synhawk builds omnimodal foundation models for communication integrity, aimed at infrastructure-side deployment in telco and banking sectors. Our platform analyzes the integrity of audio...
Suggested
Immediate start
Shift work
Synhawk
San Francisco, CA
2 days ago
Infrastructure Engineer - Member of Technical Staff
$200k - $400k
...Infrastructure Engineer Opportunity We are looking for an Infrastructure Engineer who thrives on... ...resource allocation to ensure our real-time AI features hit their latency targets.... ...: Ability to write clear technical specs for both internal teams and external...
Flexible hours
Simile
San Francisco, CA
1 day ago
Member of Technical Staff (AI Inference Engineer)
...Inference Engine Engineer We build and run the inference engine behind every Perplexity query and deploy dozens of model architectures... ..., text-generation, and multimodal models in our inference infrastructure, from weight loading, request scheduling and KV-cache...
Perplexity AI
San Francisco, CA
4 days ago
Member of Technical Staff (AI Software Engineer, Agents)
...Perplexity Agent Engineer Perplexity is seeking an energetic engineer to join our highly... ...team. The Agents team consists of AI/ML, backend, and full-stack engineers who... ...Develop and leverage cutting-edge AI models, infrastructure, and browser technologies to advance the...
Flexible hours
Perplexity AI
San Francisco, CA
4 days ago
Member of Technical Staff (Software Engineer, Cloud Infrastructure)
...About The Role The Cloud Infrastructure team owns the foundational cloud... ...Own the roadmap and technical strategy for agent-driven cloud... ...low-latency, high-throughput AI workloads. Architect and... ...) and strong software engineering skills in at least one of Python...
Perplexity AI
San Francisco, CA
1 day ago
Member of Technical Staff (Software Engineer, Storage Platform)
...Role The Storage Platform team owns the infrastructure that powers how Perplexity persists,... ...and cost-efficiency for every product and AI workload. This foundational, high-... ...excellence around storage, the team enables engineers across Perplexity to focus on product...
Perplexity AI
San Francisco, CA
1 day ago
Member of Technical Staff (Cloud Security Engineer)
...Cloud Security Engineer Perplexity is seeking a highly experienced and hands-on Cloud... ...to build and maintain secure, scalable infrastructure that empowers engineers to innovate quickly... ...languages ~ Bonus: Experience with AI/ML infrastructure and multi-cloud environments...
Perplexity AI
San Francisco, CA
4 days ago
Member of Technical Staff (Software Engineer, Backend Platform)
...enabling every product and AI team to build with... ...maintains critical infrastructure, including backend systems... ...-depth. Set the technical bar for backend... ...area, mentoring other engineers and making long-term architectural... ..., more for senior and staff). Strong system...
Perplexity AI
San Francisco, CA
1 day ago
Member of Technical Staff (Software Engineer, AI Platform)
$220k - $405k
Perplexity is seeking an experienced Software Engineer focusing on building the next-gen AI Foundation & Platform to help revolutionize the way people... ...end-to-end AI data, evaluation and personalization infrastructure and flywheel which powers almost all agent products....
Worldwide
Perplexity
San Francisco, CA
1 day ago
AI Infrastructure Engineer
...AI Infrastructure SpecialistAs vCluster's AI Infrastructure Specialist, you... ...will be one of the first team members a neocloud or AI Factory engages with at a technical depth, and the playbooks you... ...Feedback Loop: Collaborate with Engineering and Product to surface...
Remote work
Flexible hours
vCluster
San Francisco, CA
2 days ago
Member of Technical Staff (Backend Software Engineer, API Platform)
$180k - $300k
...time Location Type Hybrid Department Platform & Infrastructure Compensation $180K - $300K • Offers Equity Perplexity... ...queries to enterprise-scale integrations. As a Staff Backend Engineer, you will shape the technical foundation of Perplexity’s external platform. You’...
Full time
Worldwide
Pantera Capital
San Francisco, CA
4 days ago
Member of Technical Staff - Principal Platform Engineer
$200k - $350k
...Scientific builds and commercializes AI agents for science.... ...team of top researchers and engineers across AI and biology to build... ...operating the core platform infrastructure that powers autonomous scientific... ...at the senior level is about technical ownership and leverage—...
Work at office
Edison Scientific Inc.
San Francisco, CA
4 days ago
Member of Technical Staff, AI Reliability & Monitoring Engineering Lead
$256k
...picture and our vision at Postman. THE OPPORTUNITY Postman is seeking an experienced AI Systems Reliability Engineer to help define, build, and maintain the infrastructure and processes that ensure the reliability, scalability, and performance of Postman’s AI-powered...
Full time
Work at office
Flexible hours
3 days per week
Postman
San Francisco, CA
1 day ago
Senior or Staff AI Infrastructure Engineer
$200k - $240k
...blockchain analytics and AI solutions to help... ...for all. The AI Engineering Team is chartered... ...high-performance infrastructure, and operational... ...market. As a Senior or Staff AI Infrastructure... ...and scaling the technical infrastructure for... ..., mentors team members, and enhances cross...
Remote work
Worldwide
TRM Labs
San Francisco, CA
4 days ago
Senior AI Platform Engineer
$180k - $200k
...onsite 6 times per month) Title: Senior AI Platform Engineer Job Description We are building a... ...seamlessly integrates with our big data infrastructure, and enabling scalable, intelligent,... ..., people with disabilities, members of ethnic minorities, foreign-born residents...
Full time
Work at office
Local area
Vaco
San Francisco, CA
4 days ago
Member of Technical Staff - RL Infrastructure
$180k
...Member Of Technical Staff - RL Infrastructure Palo Alto, CA xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit... ...small, highly motivated, and focused on engineering excellence. This organization is for...
Temporary work
Xai
San Francisco, CA
4 days ago
Member of the Technical Staff Applied AI, Frontend
...Member Of Technical Staff – Frontend Stuut is transforming accounts receivable for B2B companies—making collections smarter and faster for... ...intuitive and impactful. You'll work closely with Design, Backend Engineering, Product, and our customers to deliver responsive, data-...
Full time
Flexible hours
Stuut
San Francisco, CA
3 days ago
Member of Technical Staff, Evals
$200k
...important decisions. As a Member of Technical Staff on Evals, you will build... .... You'll develop infrastructure for large-scale evaluations... ...for Strong software engineering fundamentals Experience... ...team working on frontier AI systems Magic strives...
Visa sponsorship
Relocation package
Magic AI Corp.
San Francisco, CA
2 days ago
Senior Member of Technical Staff
$160k - $230k
...Senior Member of Technical Staff Harper is an AI-native commercial insurance company in San Francisco. We'... ...You'll be one of the most senior engineers at Harper, which here means one thing... ...the systems you'll own: the AI infrastructure that turns expert insurance...
Work at office
Relocation
Harper Group
San Francisco, CA
14 hours ago
Member of Technical Staff, Therapeutics
$180k
...handle data from instrument-to-insights. We're seeking a Member of Technical Staff for Therapeutics to lead our therapeutics bench, pushing its... ...scientific decisions. You will lead a team of software engineers and biologists building datasets to teach these agents to...
Full time
Work at office
LatchBio
San Francisco, CA
15 hours ago
Member of Technical Staff Backend
$150k - $280k
...Member of Technical Staff (Backend) San Francisco, CA Compensation: $15... ...banks and fintechs using AI agents that function like... ...growth and is expanding its engineering team to accelerate development... ...Speed of Money - Build infrastructure that processes millions of...
Full time
Temporary work
H1b
Work at office
Visa sponsorship
Relocation package
Fuku
San Francisco, CA
1 day ago
Member of Technical Staff - Image / Video Generation
...Member Of Technical Staff - Image / Video Generation Freiburg (Germany) About Black Forest Labs We're the team behind Latent Diffusion, Stable Diffusion, and FLUX—foundational technologies that changed how the world creates images and video. We're creating the...
Remote work
Worldwide
2 days per week
Black Forest Labs
San Francisco, CA
3 days ago
Member of Technical Staff
...Member Of Technical Staff @ Lotus AI Lotus AI is a groundbreaking primary care app that integrates your... ...Our team includes ex-founders and engineers who have built and scaled consumer... ..., schema migrations, and data infrastructure simplification Familiarity with...
Lotus Health
San Francisco, CA
4 days ago
Member of Technical Staff - AI Research
...heterogeneous neocloud for AI workloads. As AI... ...homogeneous, vertically integrated infrastructure. Gimlet addresses this by... ...Gimlet Labs is seeking an Member of Technical Staff focused on AI research. As... ...in computer science, engineering, applied mathematics or comparable...
Gimlet Labs
San Francisco, CA
3 days ago
Member of Technical Staff - Distributed Systems
...is building the next generation of AI infrastructure: large-scale AI datacenters and the... ...role Gimlet Labs is seeking a Member of Technical Staff focused on distributed systems. In this... ...conditions. This role is well-suited for engineers who enjoy building foundational...
Gimlet Labs
San Francisco, CA
21 hours ago
Member of Technical Staff, Product Development
...Member of Technical Staff, Product Development Mirendil Mirendil is a... ...is to democratize frontier AI R&D across scientific disciplines... ..., reasoning systems, and infrastructure for large-scale experiments... ...includes researchers and engineers from Anthropic, Google DeepMind...
Mirendil
San Francisco, CA
1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Member of Technical Staff (AI Infrastructure Engineer). Be the first to apply!