Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Member of Technical Staff (AI Infrastructure Engineer)

Perplexity AI

AI Infra Engineer

We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering closely with our Inference and Research teams to build, deploy, and optimize our large-scale AI training and inference clusters.

Responsibilities
  • Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
  • Manage and optimize Slurm-based HPC environments for distributed training of large language models
  • Develop robust APIs and orchestration systems for both training pipelines and inference services
  • Implement resource scheduling and job management systems across heterogeneous compute environments
  • Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
  • Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
  • Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
  • Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands
Qualifications
  • Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
  • Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization
  • Experience with deploying and managing distributed training systems at scale
  • Deep understanding of container orchestration and distributed systems architecture
  • High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies)
  • Experience managing GPU clusters and optimizing compute resource utilization
Required Skills
  • Expert-level Kubernetes administration and YAML configuration management
  • Proficiency with Slurm job scheduling, resource management, and cluster configuration
  • Python and C++ programming with focus on systems and infrastructure automation
  • Hands-on experience with ML frameworks such as PyTorch in distributed training contexts
  • Strong understanding of networking, storage, and compute resource management for ML workloads
  • Experience developing APIs and managing distributed systems for both batch and real-time workloads
  • Solid debugging and monitoring skills with expertise in observability tools for containerized environments
Preferred Skills
  • Experience with Kubernetes operators and custom controllers for ML workloads
  • Advanced Slurm administration including multi-cluster federation and advanced scheduling policies
  • Familiarity with GPU cluster management and CUDA optimization
  • Experience with other ML frameworks like TensorFlow or distributed training libraries
  • Background in HPC environments, parallel computing, and high-performance networking
  • Knowledge of infrastructure as code (Terraform, Ansible) and GitOps practices
  • Experience with container registries, image optimization, and multi-stage builds for ML workloads
Required Experience
  • Demonstrated experience managing large-scale Kubernetes deployments in production environments
  • Proven track record with Slurm cluster administration and HPC workload management
  • Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure
  • Experience supporting both long-running training jobs and high-availability inference services
  • Ideally, 3-5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management
Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the Member of Technical Staff (AI Infrastructure Engineer) in San Francisco, CA vacancy
  •  ...servicing with the industry's most advanced AI credit-servicing agents. We are backed by...  ...Product Hunt), Charlie Songhurst (Board Member, Meta), and Michael Jones (Former Chair,...  ...the United Nations, UChicago, and Oxford engineers and researchers. Our omnichannel agents... 
    Suggested
    Full time
    Internship
    Worldwide

    Krew Research

    San Francisco, CA
    21 hours ago
  • $100k - $300k

     ...Cogent Security Cogent is an Applied AI Lab building the next generation of AI...  ...are looking for talented, ambitious AI/ML Engineers who are excited to build in the Applied AI...  ...Onboard, support and uplevel future team members Mentor and grow future junior team members... 
    Suggested

    Cogent Security

    San Francisco, CA
    2 days ago
  • $180k - $300k

     ...Member Of Technical Staff - Infrastructure Engineer Freiburg (Germany), San Francisco (USA) About Black Forest Labs We're the team behind Latent Diffusion, Stable Diffusion, and FLUX—foundational technologies that changed how the world creates images and video... 
    Suggested
    Work at office
    Remote work
    Worldwide
    Relocation
    2 days per week

    Black Forest Labs

    San Francisco, CA
    4 days ago
  • Member of Technical Staff - Applied AI Engineer Valthos | Posted Mar 3 Full-time Negotiable Advanced (5-10 yrs) Valthos Inc. Valthos is an applied...  ...build, deploy, and scale model training and evaluation infrastructure Visualize and communicate results within Valthos... 
    Suggested
    Full time
    Work at office

    Valthos

    San Francisco, CA
    1 day ago
  • Member of Technical Staff: AI Research & Engineering in Media Integrity About Synhawk Synhawk builds omnimodal foundation models for communication integrity, aimed at infrastructure-side deployment in telco and banking sectors. Our platform analyzes the integrity of audio... 
    Suggested
    Immediate start
    Shift work

    Synhawk

    San Francisco, CA
    2 days ago
  • $200k - $400k

     ...Infrastructure Engineer Opportunity We are looking for an Infrastructure Engineer who thrives on...  ...resource allocation to ensure our real-time AI features hit their latency targets....  ...: Ability to write clear technical specs for both internal teams and external... 
    Flexible hours

    Simile

    San Francisco, CA
    1 day ago
  •  ...Inference Engine Engineer We build and run the inference engine behind every Perplexity query and deploy dozens of model architectures...  ..., text-generation, and multimodal models in our inference infrastructure, from weight loading, request scheduling and KV-cache... 

    Perplexity AI

    San Francisco, CA
    4 days ago
  •  ...Perplexity Agent Engineer Perplexity is seeking an energetic engineer to join our highly...  ...team. The Agents team consists of AI/ML, backend, and full-stack engineers who...  ...Develop and leverage cutting-edge AI models, infrastructure, and browser technologies to advance the... 
    Flexible hours

    Perplexity AI

    San Francisco, CA
    4 days ago
  •  ...About The Role The Cloud Infrastructure team owns the foundational cloud...  ...Own the roadmap and technical strategy for agent-driven cloud...  ...low-latency, high-throughput AI workloads. Architect and...  ...) and strong software engineering skills in at least one of Python... 

    Perplexity AI

    San Francisco, CA
    1 day ago
  •  ...Role The Storage Platform team owns the infrastructure that powers how Perplexity persists,...  ...and cost-efficiency for every product and AI workload. This foundational, high-...  ...excellence around storage, the team enables engineers across Perplexity to focus on product... 

    Perplexity AI

    San Francisco, CA
    1 day ago
  •  ...Cloud Security Engineer Perplexity is seeking a highly experienced and hands-on Cloud...  ...to build and maintain secure, scalable infrastructure that empowers engineers to innovate quickly...  ...languages ~ Bonus: Experience with AI/ML infrastructure and multi-cloud environments... 

    Perplexity AI

    San Francisco, CA
    4 days ago
  •  ...enabling every product and AI team to build with...  ...maintains critical infrastructure, including backend systems...  ...-depth. Set the technical bar for backend...  ...area, mentoring other engineers and making long-term architectural...  ..., more for senior and staff). Strong system... 

    Perplexity AI

    San Francisco, CA
    1 day ago
  • $220k - $405k

    Perplexity is seeking an experienced Software Engineer focusing on building the next-gen AI Foundation & Platform to help revolutionize the way people...  ...end-to-end AI data, evaluation and personalization infrastructure and flywheel which powers almost all agent products.... 
    Worldwide

    Perplexity

    San Francisco, CA
    1 day ago
  •  ...AI Infrastructure SpecialistAs vCluster's AI Infrastructure Specialist, you...  ...will be one of the first team members a neocloud or AI Factory engages with at a technical depth, and the playbooks you...  ...Feedback Loop: Collaborate with Engineering and Product to surface... 
    Remote work
    Flexible hours

    vCluster

    San Francisco, CA
    2 days ago
  • $180k - $300k

     ...time Location Type Hybrid Department Platform & Infrastructure Compensation $180K - $300K • Offers Equity Perplexity...  ...queries to enterprise-scale integrations. As a Staff Backend Engineer, you will shape the technical foundation of Perplexity’s external platform. You’... 
    Full time
    Worldwide

    Pantera Capital

    San Francisco, CA
    4 days ago
  • $200k - $350k

     ...Scientific builds and commercializes AI agents for science....  ...team of top researchers and engineers across AI and biology to build...  ...operating the core platform infrastructure that powers autonomous scientific...  ...at the senior level is about technical ownership and leverage—... 
    Work at office

    Edison Scientific Inc.

    San Francisco, CA
    4 days ago
  • $256k

     ...picture and our vision at Postman. THE OPPORTUNITY Postman is seeking an experienced AI Systems Reliability Engineer to help define, build, and maintain the infrastructure and processes that ensure the reliability, scalability, and performance of Postman’s AI-powered... 
    Full time
    Work at office
    Flexible hours
    3 days per week

    Postman

    San Francisco, CA
    1 day ago
  • $200k - $240k

     ...blockchain analytics and AI solutions to help...  ...for all. The AI Engineering Team is chartered...  ...high-performance infrastructure, and operational...  ...market. As a Senior or Staff AI Infrastructure...  ...and scaling the technical infrastructure for...  ..., mentors team members, and enhances cross... 
    Remote work
    Worldwide

    TRM Labs

    San Francisco, CA
    4 days ago
  • $180k - $200k

     ...onsite 6 times per month) Title: Senior AI Platform Engineer Job Description We are building a...  ...seamlessly integrates with our big data infrastructure, and enabling scalable, intelligent,...  ..., people with disabilities, members of ethnic minorities, foreign-born residents... 
    Full time
    Work at office
    Local area

    Vaco

    San Francisco, CA
    4 days ago
  • $180k

     ...Member Of Technical Staff - RL Infrastructure Palo Alto, CA xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit...  ...small, highly motivated, and focused on engineering excellence. This organization is for... 
    Temporary work

    Xai

    San Francisco, CA
    4 days ago
  •  ...Member Of Technical Staff – Frontend Stuut is transforming accounts receivable for B2B companies—making collections smarter and faster for...  ...intuitive and impactful. You'll work closely with Design, Backend Engineering, Product, and our customers to deliver responsive, data-... 
    Full time
    Flexible hours

    Stuut

    San Francisco, CA
    3 days ago
  • $200k

     ...important decisions. As a Member of Technical Staff on Evals, you will build...  .... You'll develop infrastructure for large-scale evaluations...  ...for Strong software engineering fundamentals Experience...  ...team working on frontier AI systems Magic strives... 
    Visa sponsorship
    Relocation package

    Magic AI Corp.

    San Francisco, CA
    2 days ago
  • $160k - $230k

     ...Senior Member of Technical Staff Harper is an AI-native commercial insurance company in San Francisco. We'...  ...You'll be one of the most senior engineers at Harper, which here means one thing...  ...the systems you'll own: the AI infrastructure that turns expert insurance... 
    Work at office
    Relocation

    Harper Group

    San Francisco, CA
    14 hours ago
  • $180k

     ...handle data from instrument-to-insights. We're seeking a Member of Technical Staff for Therapeutics to lead our therapeutics bench, pushing its...  ...scientific decisions. You will lead a team of software engineers and biologists building datasets to teach these agents to... 
    Full time
    Work at office

    LatchBio

    San Francisco, CA
    15 hours ago
  • $150k - $280k

     ...Member of Technical Staff (Backend) San Francisco, CA Compensation: $15...  ...banks and fintechs using AI agents that function like...  ...growth and is expanding its engineering team to accelerate development...  ...Speed of Money - Build infrastructure that processes millions of... 
    Full time
    Temporary work
    H1b
    Work at office
    Visa sponsorship
    Relocation package

    Fuku

    San Francisco, CA
    1 day ago
  •  ...Member Of Technical Staff - Image / Video Generation Freiburg (Germany) About Black Forest Labs We're the team behind Latent Diffusion, Stable Diffusion, and FLUX—foundational technologies that changed how the world creates images and video. We're creating the... 
    Remote work
    Worldwide
    2 days per week

    Black Forest Labs

    San Francisco, CA
    3 days ago
  •  ...Member Of Technical Staff @ Lotus AI Lotus AI is a groundbreaking primary care app that integrates your...  ...Our team includes ex-founders and engineers who have built and scaled consumer...  ..., schema migrations, and data infrastructure simplification Familiarity with... 

    Lotus Health

    San Francisco, CA
    4 days ago
  •  ...heterogeneous neocloud for AI workloads. As AI...  ...homogeneous, vertically integrated infrastructure. Gimlet addresses this by...  ...Gimlet Labs is seeking an Member of Technical Staff focused on AI research. As...  ...in computer science, engineering, applied mathematics or comparable... 

    Gimlet Labs

    San Francisco, CA
    3 days ago
  •  ...is building the next generation of AI infrastructure: large-scale AI datacenters and the...  ...role Gimlet Labs is seeking a Member of Technical Staff focused on distributed systems. In this...  ...conditions. This role is well-suited for engineers who enjoy building foundational... 

    Gimlet Labs

    San Francisco, CA
    21 hours ago
  •  ...Member of Technical Staff, Product Development Mirendil Mirendil is a...  ...is to democratize frontier AI R&D across scientific disciplines...  ..., reasoning systems, and infrastructure for large-scale experiments...  ...includes researchers and engineers from Anthropic, Google DeepMind... 

    Mirendil

    San Francisco, CA
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Member of Technical Staff (AI Infrastructure Engineer). Be the first to apply!