Member of Technical Staff (AI Infrastructure Engineer)
Perplexity AI
AI Infra Engineer
We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering closely with our Inference and Research teams to build, deploy, and optimize our large-scale AI training and inference clusters.
Responsibilities
- Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
- Manage and optimize Slurm-based HPC environments for distributed training of large language models
- Develop robust APIs and orchestration systems for both training pipelines and inference services
- Implement resource scheduling and job management systems across heterogeneous compute environments
- Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
- Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
- Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
- Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands
Qualifications
- Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
- Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization
- Experience with deploying and managing distributed training systems at scale
- Deep understanding of container orchestration and distributed systems architecture
- High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies)
- Experience managing GPU clusters and optimizing compute resource utilization
Required Skills
- Expert-level Kubernetes administration and YAML configuration management
- Proficiency with Slurm job scheduling, resource management, and cluster configuration
- Python and C++ programming with focus on systems and infrastructure automation
- Hands-on experience with ML frameworks such as PyTorch in distributed training contexts
- Strong understanding of networking, storage, and compute resource management for ML workloads
- Experience developing APIs and managing distributed systems for both batch and real-time workloads
- Solid debugging and monitoring skills with expertise in observability tools for containerized environments
Preferred Skills
- Experience with Kubernetes operators and custom controllers for ML workloads
- Advanced Slurm administration including multi-cluster federation and advanced scheduling policies
- Familiarity with GPU cluster management and CUDA optimization
- Experience with other ML frameworks like TensorFlow or distributed training libraries
- Background in HPC environments, parallel computing, and high-performance networking
- Knowledge of infrastructure as code (Terraform, Ansible) and GitOps practices
- Experience with container registries, image optimization, and multi-stage builds for ML workloads
Required Experience
- Demonstrated experience managing large-scale Kubernetes deployments in production environments
- Proven track record with Slurm cluster administration and HPC workload management
- Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure
- Experience supporting both long-running training jobs and high-availability inference services
- Ideally, 3-5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management
Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the Member of Technical Staff (AI Infrastructure Engineer) in San Francisco, CA vacancy
- ...servicing with the industry's most advanced AI credit-servicing agents. We are backed by... ...Product Hunt), Charlie Songhurst (Board Member, Meta), and Michael Jones (Former Chair,... ...the United Nations, UChicago, and Oxford engineers and researchers. Our omnichannel agents...SuggestedFull timeInternshipWorldwide
$100k - $300k
...Cogent Security Cogent is an Applied AI Lab building the next generation of AI... ...are looking for talented, ambitious AI/ML Engineers who are excited to build in the Applied AI... ...Onboard, support and uplevel future team members Mentor and grow future junior team members...Suggested$180k - $300k
...Member Of Technical Staff - Infrastructure Engineer Freiburg (Germany), San Francisco (USA) About Black Forest Labs We're the team behind Latent Diffusion, Stable Diffusion, and FLUX—foundational technologies that changed how the world creates images and video...SuggestedWork at officeRemote workWorldwideRelocation2 days per week- Member of Technical Staff - Applied AI Engineer Valthos | Posted Mar 3 Full-time Negotiable Advanced (5-10 yrs) Valthos Inc. Valthos is an applied... ...build, deploy, and scale model training and evaluation infrastructure Visualize and communicate results within Valthos...SuggestedFull timeWork at office
- Member of Technical Staff: AI Research & Engineering in Media Integrity About Synhawk Synhawk builds omnimodal foundation models for communication integrity, aimed at infrastructure-side deployment in telco and banking sectors. Our platform analyzes the integrity of audio...SuggestedImmediate startShift work
$200k - $400k
...Infrastructure Engineer Opportunity We are looking for an Infrastructure Engineer who thrives on... ...resource allocation to ensure our real-time AI features hit their latency targets.... ...: Ability to write clear technical specs for both internal teams and external...Flexible hours- ...Inference Engine Engineer We build and run the inference engine behind every Perplexity query and deploy dozens of model architectures... ..., text-generation, and multimodal models in our inference infrastructure, from weight loading, request scheduling and KV-cache...
- ...Perplexity Agent Engineer Perplexity is seeking an energetic engineer to join our highly... ...team. The Agents team consists of AI/ML, backend, and full-stack engineers who... ...Develop and leverage cutting-edge AI models, infrastructure, and browser technologies to advance the...Flexible hours
- ...About The Role The Cloud Infrastructure team owns the foundational cloud... ...Own the roadmap and technical strategy for agent-driven cloud... ...low-latency, high-throughput AI workloads. Architect and... ...) and strong software engineering skills in at least one of Python...
- ...Role The Storage Platform team owns the infrastructure that powers how Perplexity persists,... ...and cost-efficiency for every product and AI workload. This foundational, high-... ...excellence around storage, the team enables engineers across Perplexity to focus on product...
- ...Cloud Security Engineer Perplexity is seeking a highly experienced and hands-on Cloud... ...to build and maintain secure, scalable infrastructure that empowers engineers to innovate quickly... ...languages ~ Bonus: Experience with AI/ML infrastructure and multi-cloud environments...
- ...enabling every product and AI team to build with... ...maintains critical infrastructure, including backend systems... ...-depth. Set the technical bar for backend... ...area, mentoring other engineers and making long-term architectural... ..., more for senior and staff). Strong system...
$220k - $405k
Perplexity is seeking an experienced Software Engineer focusing on building the next-gen AI Foundation & Platform to help revolutionize the way people... ...end-to-end AI data, evaluation and personalization infrastructure and flywheel which powers almost all agent products....Worldwide- ...AI Infrastructure SpecialistAs vCluster's AI Infrastructure Specialist, you... ...will be one of the first team members a neocloud or AI Factory engages with at a technical depth, and the playbooks you... ...Feedback Loop: Collaborate with Engineering and Product to surface...Remote workFlexible hours
$180k - $300k
...time Location Type Hybrid Department Platform & Infrastructure Compensation $180K - $300K • Offers Equity Perplexity... ...queries to enterprise-scale integrations. As a Staff Backend Engineer, you will shape the technical foundation of Perplexity’s external platform. You’...Full timeWorldwide$200k - $350k
...Scientific builds and commercializes AI agents for science.... ...team of top researchers and engineers across AI and biology to build... ...operating the core platform infrastructure that powers autonomous scientific... ...at the senior level is about technical ownership and leverage—...Work at office$256k
...picture and our vision at Postman. THE OPPORTUNITY Postman is seeking an experienced AI Systems Reliability Engineer to help define, build, and maintain the infrastructure and processes that ensure the reliability, scalability, and performance of Postman’s AI-powered...Full timeWork at officeFlexible hours3 days per week$200k - $240k
...blockchain analytics and AI solutions to help... ...for all. The AI Engineering Team is chartered... ...high-performance infrastructure, and operational... ...market. As a Senior or Staff AI Infrastructure... ...and scaling the technical infrastructure for... ..., mentors team members, and enhances cross...Remote workWorldwide$180k - $200k
...onsite 6 times per month) Title: Senior AI Platform Engineer Job Description We are building a... ...seamlessly integrates with our big data infrastructure, and enabling scalable, intelligent,... ..., people with disabilities, members of ethnic minorities, foreign-born residents...Full timeWork at officeLocal area$180k
...Member Of Technical Staff - RL Infrastructure Palo Alto, CA xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit... ...small, highly motivated, and focused on engineering excellence. This organization is for...Temporary work- ...Member Of Technical Staff – Frontend Stuut is transforming accounts receivable for B2B companies—making collections smarter and faster for... ...intuitive and impactful. You'll work closely with Design, Backend Engineering, Product, and our customers to deliver responsive, data-...Full timeFlexible hours
$200k
...important decisions. As a Member of Technical Staff on Evals, you will build... .... You'll develop infrastructure for large-scale evaluations... ...for Strong software engineering fundamentals Experience... ...team working on frontier AI systems Magic strives...Visa sponsorshipRelocation package$160k - $230k
...Senior Member of Technical Staff Harper is an AI-native commercial insurance company in San Francisco. We'... ...You'll be one of the most senior engineers at Harper, which here means one thing... ...the systems you'll own: the AI infrastructure that turns expert insurance...Work at officeRelocation$180k
...handle data from instrument-to-insights. We're seeking a Member of Technical Staff for Therapeutics to lead our therapeutics bench, pushing its... ...scientific decisions. You will lead a team of software engineers and biologists building datasets to teach these agents to...Full timeWork at office$150k - $280k
...Member of Technical Staff (Backend) San Francisco, CA Compensation: $15... ...banks and fintechs using AI agents that function like... ...growth and is expanding its engineering team to accelerate development... ...Speed of Money - Build infrastructure that processes millions of...Full timeTemporary workH1bWork at officeVisa sponsorshipRelocation package- ...Member Of Technical Staff - Image / Video Generation Freiburg (Germany) About Black Forest Labs We're the team behind Latent Diffusion, Stable Diffusion, and FLUX—foundational technologies that changed how the world creates images and video. We're creating the...Remote workWorldwide2 days per week
- ...Member Of Technical Staff @ Lotus AI Lotus AI is a groundbreaking primary care app that integrates your... ...Our team includes ex-founders and engineers who have built and scaled consumer... ..., schema migrations, and data infrastructure simplification Familiarity with...
- ...heterogeneous neocloud for AI workloads. As AI... ...homogeneous, vertically integrated infrastructure. Gimlet addresses this by... ...Gimlet Labs is seeking an Member of Technical Staff focused on AI research. As... ...in computer science, engineering, applied mathematics or comparable...
- ...is building the next generation of AI infrastructure: large-scale AI datacenters and the... ...role Gimlet Labs is seeking a Member of Technical Staff focused on distributed systems. In this... ...conditions. This role is well-suited for engineers who enjoy building foundational...
- ...Member of Technical Staff, Product Development Mirendil Mirendil is a... ...is to democratize frontier AI R&D across scientific disciplines... ..., reasoning systems, and infrastructure for large-scale experiments... ...includes researchers and engineers from Anthropic, Google DeepMind...
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Member of Technical Staff (AI Infrastructure Engineer). Be the first to apply!
Related searches
- technical support associate San Francisco, CA
- decision support analyst San Francisco, CA
- desktop support analyst San Francisco, CA
- senior technical analyst San Francisco, CA
- user support analyst San Francisco, CA
- customer support technician San Francisco, CA
- technical support analyst San Francisco, CA
- support analyst San Francisco, CA
- tech assistant San Francisco, CA
- technical support specialist San Francisco, CA

