AI Platform Support Engineer (US)

$115k - $140k

Lightning AI

Platform Support Engineer (US)

San Francisco, California, United States; Seattle, Washington, United States

Who We Are

Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems—designed to take ideas from research to production with less friction.

Through our merger with Voltage Park, a neocloud and AI Factory, Lightning AI combines developer-first software with cost-efficient, large-scale compute. Teams get the tools they need for experimentation, training, and production inference, with security, observability, and control built in.

We serve solo researchers, startups, and large enterprises. Lightning AI operates globally with offices in New York City, San Francisco, Seattle, and London, and is backed by Coatue, Index Ventures, Bain Capital Ventures, and Firstminute.

What We're Looking For

Lightning AI is looking to hire a Platform Support Engineer to join our US Customer Experience team, supporting ML engineers running large-scale training and inference workloads across cloud infrastructure, Kubernetes, and GPU platforms in production environments.

This role sits at the intersection of ML systems, cloud infrastructure, Kubernetes, and customers. You'll support engineers training models, deploying inference systems, and scaling GPU workloads in production. You are not a ticket router or traditional support engineer. You are a technical partner to ML teams - helping diagnose failures, improve reliability, and guide customers through complex distributed systems problems.

The problems range from Kubernetes scheduling and GPU orchestration to distributed PyTorch failures, inference latency, networking bottlenecks, storage performance, and platform reliability. You'll gain exposure to a wide variety of real world AI workloads across industries and help shape the infrastructure powering the next generation of ML applications.

What You'll Do

Work Directly With ML Engineers

Partner directly with customer engineering teams running training and inference workloads in production
Help customers diagnose and resolve complex distributed systems and ML infrastructure issues
Act as a technical advisor during high impact incidents and platform degradation events
Translate infrastructure level issues into actionable guidance for ML engineers
Build credibility with customers through strong technical reasoning and clear communication

Debug ML Infrastructure & Distributed Workloads

Investigate failures involving distributed training, Kubernetes orchestration, GPU allocation, networking, and storage systems
Troubleshoot PyTorch, CUDA, NCCL, and inference serving related issues
Analyze logs, metrics, traces, and system behavior to isolate root causes
Debug containerized workloads running across Kubernetes and bare metal GPU environments
Support customers scaling workloads across multi node GPU systems
Diagnose performance bottlenecks involving compute, memory, networking, or storage

Improve Reliability & Platform Operations

Identify recurring patterns across customer issues and drive long term reliability improvements
Contribute to post incident reviews and operational improvements
Build internal tooling, automation, documentation, and runbooks
Partner closely with infrastructure, networking, and platform engineering teams
Help improve observability, operational visibility, and troubleshooting workflows
Improve the customer experience through better processes and technical guidance

What This Role Is Not

To set clear expectations:

This is not a traditional help desk or ticket routing support role
This is not purely customer success or account management
This is not a backend engineering role
This is not a passive escalation position

This role is for engineers who enjoy solving difficult technical problems while working closely with other engineers.

What You'll Need

Required Qualifications

Infrastructure & Systems

Strong software engineering and systems troubleshooting background
Experience with Kubernetes and containerized environments
Linux systems knowledge, including networking, storage, process management, and performance tuning
Experience with cloud infrastructure and distributed systems
Experience with observability and debugging tools such as Prometheus, Grafana, or OpenTelemetry

ML Infrastructure Experience

Hands on experience operating machine learning workloads in production or research environments
Experience with distributed ML systems and tooling such as PyTorch, CUDA, or NCCL
Familiarity with GPU infrastructure and orchestration
Experience troubleshooting performance, reliability, or scaling issues in ML infrastructure
Understanding of the operational challenges involved in running ML systems at scale

Collaboration

Strong communication skills and ability to work directly with highly technical customers and engineering teams
Comfortable operating in fast moving, highly ambiguous environments
Enjoys solving complex technical problems collaboratively

Ideal Experience

Experience with large scale model training or distributed inference systems
Familiarity with Ray, Kubeflow, Slurm, or similar distributed scheduling platforms
Experience with InfiniBand, RDMA, or high-performance networking
Experience operating bare metal infrastructure
Familiarity with storage systems commonly used in ML environments
Experience working at an AI infrastructure, cloud, MLOps, or developer tooling company
Contributions to platform engineering, developer infrastructure, or operational tooling projects
Experience writing automation, tooling, or scripts in Python or similar languages

This role is hybrid out of our Seattle or San Francisco offices, with an in-office requirement of at least 2 days per week and occasional team and company offsites. The role follows a Monday–Friday schedule, with working hours from 8:00 AM to 5:00 PM PST. We are not able to provide visa sponsorship for this role at this time.

We are committed to offering competitive compensation that reflects the value each team member brings to our mission. Final offers are based on factors such as experience, skills, geographic location, and role expectations. In addition to base salary, our total rewards package for eligible roles includes a discretionary bonus, a meaningful equity component, and comprehensive benefits.

The anticipated annual base salary range for this role is:

$115,000 - $140,000 USD

Benefits and Perks

We offer a comprehensive and competitive benefits package designed to support our employees' health, well-being, and long-term success. Benefits may vary by location, team, and role.

Benefits include:

Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.)
Retirement and financial wellness support (U.S.); Pension contribution (U.K.)
Generous paid time off, plus holidays
Paid parental leave
Professional development support
Wellness and work-from-home stipends
Flexible work environment

At Lightning AI, we are committed to fostering an inclusive and diverse workplace. We believe that diverse teams drive innovation and create better products. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity, national origin, age, disability, veteran status, or any other protected characteristic. We are dedicated to building a culture where everyone can thrive and contribute to their fullest potential.

Apply

Vacancy posted 5 hours ago

Similar jobs that could be interesting for youBased on the AI Platform Support Engineer (US) in San Francisco, CA vacancy

Platform Support Engineer (US)
$115k - $140k
...Who We Are Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems... ...AI is looking to hire a Platform Support Engineer to join our US Customer Experience team, supporting...
Suggested
Work at office
Work from home
Monday to Friday
Flexible hours
2 days per week
Lightning AI
San Francisco, CA
1 day ago
Platform Full-Stack Engineer - US Defense / Public Sector
$195.3k - $199.2k
Technology & Digital Platform Full-Stack Engineer - US Defense / Public Sector Job ID: 106486 Boston... ...to do work that matters, alongside supportive leaders who will help you grow faster... ...minds on a product team developing an AI Platform that will enable next-generation...
Suggested
Hourly pay
Apprenticeship
Work at office
Easy work
McKinsey & Company
San Francisco, CA
2 days ago
AI Support Engineer (US)
$100k - $170k
...actually run it. Dust is the multiplayer AI platform for human-agent collaboration. It gives... ...while doing so. The Generalist named us among the Future 50. About the Role... ...even do this the same way?" The AI Support Engineer applies this mindset to Support. You'll...
Suggested
Work at office
Immediate start
Work from home
Flexible hours
Dust
San Francisco, CA
1 day ago
AI Platform Support Engineer
...ecosystem powering the next generation of AI products. We build the infrastructure,... ...just possible, but practical: a unified platform where high-performance inference, orchestration... ...is seeking a highly skilled Technical Support Engineer to provide high-quality support and...
Suggested
Fal
San Francisco, CA
2 days ago
Technical Support Engineer
...technology? Join our team at OSARO as a Technical Support Engineer and help us develop cutting-edge AI-based autonomous industrial robotic solutions. If you... ..., and application logs, as well as the tools and platforms used for their management and analysis. ~ Basic understanding...
Suggested
Full time
Immediate start
Remote work
Flexible hours
OSARO
San Francisco, CA
27 days ago
Technical Support Engineer, US
...About Us Socket helps devs and security teams ship faster by cutting out security busywork. Thousands of orgs use Socket... ...leaders. About the Role We're looking for a Technical Support Engineer to join as Socket's first dedicated support hire in the US. You...
Remote work
Flexible hours
Socket
San Francisco, CA
1 day ago
IT Support Engineer
$170k - $230k
...IT Support Engineer Seattle, WA About Anthropic Anthropic's mission... ...interpretable, and steerable AI systems. We want AI to be... ...operational improvements that help us scale. You'll work closely... ..., Android, and our core SaaS platforms (Google Workspace, Slack, GitHub...
Work at office
Immediate start
Visa sponsorship
Anthropic
San Francisco, CA
1 day ago
Customer Support Engineer (Inference)
About the Role As a Customer Support Engineer at a pioneering AI company, you'll be the first line of defense to support customers as they build out... ...benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $160,00...
Full time
Remote work
Flexible hours
Night shift
Weekend work
Together AI
San Francisco, CA
4 days ago
Technical Support Engineer 3, Premier - San Francisco
$96k - $140k
...successful. Premier Technical Support Engineers (PSEs) are primarily focused... ...our clients on the use of the platform Develop relationships with... ...security platform for the AI era, providing businesses with... ...If you would like to contact us regarding the accessibility of...
Work at office
Immediate start
Worldwide
Datadog
San Francisco, CA
23 hours ago
Technical Support Engineer
...Technical Support Engineer We started by building infrastructure to run... ...infrastructure into a broader platform: running agent sandboxes at scale... ...Experience in setting up AI/agent native workflows. Comfortable... ...stock options. 12 weeks fully paid parental leave (US)....
Blacksmith
San Francisco, CA
23 hours ago
Technical Support Engineer
...Technical Support Engineer San Francisco, CA About Starburst Starburst delivers enterprise... ...environments, Starburst helps enterprises power AI and analytics without the cost and... ...perspective, background and experience will enable us to own what we do, drive our success and...
Local area
Flexible hours
Starburst
San Francisco, CA
4 days ago
Technical Support Engineer
$90k - $125k
...Technical Support Engineer San Francisco, CA Sigma is growing rapidly... ...data challenges using the Sigma platform. You'll work closely with... ...looking for people that will help us grow our company and... ...About us: Sigma is the AI Apps and agentic analytics platform...
Full time
Work experience placement
Work at office
Flexible hours
Sigma Computing
San Francisco, CA
1 day ago
Senior Technical Support Engineer
$125k - $150k
...and our team is building its AI-native future. About the Role At Sentry, Support is an engineering discipline. Our customers are... ...for a veteran engineer to help us redefine the standard of technical... ...to prioritize high-impact platform fixes. Engineer Agentic Operations...
Hourly pay
Full time
Night shift
Sentry
San Francisco, CA
4 days ago
Customer Support Engineer
...the next hyperscaler for AI agents. ABOUT THE... ...This is our second support hire. Customers using E... ...between customers and the engineering team. You'll spend... ...The customers writing to us are technical, so the... ...telemetry, and correlate with platform metrics to figure out...
Work from home
E2B
San Francisco, CA
1 day ago
Technical Customer Support Engineer, Tier 2 - West Cost (Second shift)
$36 - $43 per hour
...innovation lies our advanced AI-centric technology, representing... ...for a Technical Customer Support Engineer Tier II to join our Support... ...mindset ~ Experience with cloud platforms like AWS, Azure, or GCP ~... ...as we scale across the US. Compensation: UVeye...
Hourly pay
Afternoon shift
UVeye
San Francisco, CA
4 days ago
Support Engineer
...About us At Sierra, we're creating a platform to help businesses build better, more human customer experiences with AI. We are primarily an in-person company based in San Francisco... ...teams for Google Workspace. Support Engineering at Sierra Companies use...
Full time
Flexible hours
Sierra
San Francisco, CA
3 days ago
Technical Support Engineer
$50k - $80k
...to apply for the Technical Support Engineer role at Wispr Flow Base... ...is the first voice dictation platform people use more than their keyboards... ...you ask. We are a team of AI researchers, designers,... ...processes better. A bit about us We are a collection of...
Night shift
Wispr Flow
San Francisco, CA
23 hours ago
Senior Technical Support Engineer
$130k - $195k
...About Us At LangChain, our mission is to make... ...foundation for agent engineering in the real world, helping... ...to production-ready AI agents that teams can rely... ...grown to also offer a platform for building,... ...hiring a Senior Technical Support Engineer to lead our customer...
Work at office
Remote work
Flexible hours
LangChain, Inc
San Francisco, CA
1 day ago
Support Engineer
$120k - $160k
...About Fable Security AI-driven threats and human error... .... Fable is the human risk platform that directly shapes employee... ...The Role As Fable's first Support Engineer, you will own the end-to-end... ...CCSP, CCSK, CISSP) Why Join Us? Competitive base +...
Work experience placement
Flexible hours
Fable
San Francisco, CA
1 day ago
Support Engineer
...Support Engineer Zuma is pioneering the future of agentic AI and our focus is to transform the rental market experience for... ...property manager alike. Our innovative platform is engineered from the ground up... ...management business across the US and Canada, a ~$200B market....
Immediate start
Shift work
ZUMA
San Francisco, CA
21 days ago
Support Engineer
$210k - $250k
...interpretable, and steerable AI systems. We want AI to... ...committed researchers, engineers, policy experts, and... ...We are hiring Support Engineers to serve as the... ...engineer and an internal platform team without losing either... ...Your safety matters to us. To protect yourself from...
Work at office
Visa sponsorship
Flexible hours
Anthropic
San Francisco, CA
4 days ago
Senior Support Engineer - San Francisco
$234k - $260k
...Team The Technical Support team is responsible for... ...Success, Product, Engineering and others to deliver the... ...leverage the latest in AI to scale our support operations... ...with the OpenAI API platform. The nature of this... ...Fair Chance Act, for US-based candidates. For unincorporated...
Work at office
Relocation package
Night shift
Weekend work
OpenAI
San Francisco, CA
1 day ago
Designated Technical Support Engineer
$120k - $190k
About Glean: Glean is the Work AI platform that helps everyone work smarter with AI. What... ...looking for a talented Designated Technical Support Engineer to join our rapidly expanding, venture-... ...processing of your data as required. US applicants and their applications are...
Work experience placement
Home office
Flexible hours
Shift work
Glean.info
San Francisco, CA
4 days ago
Technical Support Engineer - West Coast
...Carlo is the agent trust platform that unifies data and... ...and improve production AI systems. As enterprises... ...reliability infrastructure to support them along this AI... ...Technical Support Engineers to own the end-to-end customer... .... Location: US West Coast (Pacific time...
Remote work
Montecarlo
San Francisco, CA
4 days ago
Customer Support Engineer
$100 per hour
...premier conference for video engineers in the world. We’re backed by... ...and small, love working with us and love our team. We are building... ...! About the Role This is a Support Engineering role as part of... ...service team to improve Mux’s AI-support workflows. Build out...
Temporary work
Work at office
Flexible hours
Mux
San Francisco, CA
2 days ago
Support Engineer (SF)
...is redefining customer support for the next generation... ...fastest, most powerful platform to help companies move... ...supercharge their workflows with AI. We're a small,... ...As one of our Support Engineers, you'll be a clear... ...handover between EU and US timezones so nothing drops...
Work at office
Immediate start
Day shift
3 days per week
Early shift
Plain
San Francisco, CA
23 hours ago
Post-AGI Support Engineer - Hybrid Work & Relocation Help
...A leading AI company in San Francisco is seeking a User Operations Specialist to enhance... ...over 8 years of experience in technical support, possesses strong critical thinking skills... ...hybrid work model offering flexibility. Join us in shaping the future of technology. #J-18...
Relocation
OpenAI
San Francisco, CA
3 days ago
Technical Support Engineer
...re changing that, using AI to disrupt a massive... ...technical backbone of our support team, resolving complex... ..., collaborating with engineering to fix bugs, and building... ...a safe and trusted platform. Our team has a strong... ...ideas shine. Think of us as your friendly neighborhood...
Full time
Work at office
Local area
Remote work
Work from home
Gamma
San Francisco, CA
3 days ago
AI Support Engineer (US)
About Dust We're creating a new AI operating system that has the potential to change the way companies operate . Our mission at... ...used weekly by 70%+ of our customers' teams. We're looking for a Support Engineer to define what AI-native support looks like in practice. The...
Dust
San Francisco, CA
2 days ago
Technical Support Engineer-Healthcare AI
$80k - $90k
...Remote, located in the US; San Francisco Bay Area... ...time employee Department: Engineering Reports to: Head Of... ...basics, queues, cloud platforms (e.g., GCP/AWS), CI/CD.... ...developing with generative AI or voice agents.... ...Prior customer-facing or support/TA experience. How we’...
Permanent employment
Full time
Remote work
Ellipsis Health
San Francisco, CA
3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to AI Platform Support Engineer (US). Be the first to apply!