HPC Operations Engineer
Lambda
Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU. If you'd like to build the world's best AI cloud, join us. *Note: This position requires presence in our San Francisco / Bellevue office location 4 days per week; Lambda’s designated work from home day is currently Tuesday. Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance. What You’ll Do Remotely deploy and configure large-scale HPC clusters for AI workloads (up to many thousands of nodes) Remotely install and configure operating systems, firmware, software, and networking on HPC clusters both manually and using automation tools Troubleshoot and resolve HPC cluster issues working closely with physical deployment teams on-site Provide clear and detailed requirements back to other engineering teams on gaps and improvement areas, specifically in the areas of simplification, stability, and operational efficiency Contribute to the creation of and maintenance of Standard Operating Procedures Provide regular and well-communicated updates to project leads throughout each deployment Mentor and assist less experienced team members Stay up-to-date on the latest HPC/AI technologies and best practices You Are a deeply experienced HPC engineer comfortable with logical provisioning of a cluster Have a strong understanding of HPC/AI architecture, operating systems, firmware, software, and networking 5+ years of experience in deploying and configuring HPC clusters for AI workloads Have an innate attention to detail Are in expert in configuring and troubleshooting: SFP+ fiber, Infiniband (IB), and 100 GbE network fabrics Ethernet, switching, power infrastructure, GPU direct, RDMA, NCCL, Horovod environments Linux based compute nodes, firmware updates, driver installation SLURM, Kubernetes, or other job scheduling systems Work well under deadlines and structured project plans also knowing when and how to ask for changes to project timelines Have excellent problem solving and troubleshooting skills Have flexibility to travel to our North American data centers as on-site needs arise or as part of training exercises Are able to work independently and as part of a team Are comfortable mentoring and supporting junior HPC engineers on cluster deployments Nice to Have Experience with machine learning and deep learning frameworks (PyTorch, Tensorflow) and benchmarking tools (DeepSpeed, MLPerf) Experience with containerization technologies ( Docker, Kubernetes) Experience working with the technologies that underpin our cloud business ( GPU acceleration, virtualization, and cloud computing) Keen situational awareness in customer situations, employing diplomacy and tact Bachelors degree in EE, CS, Physics, Mathematics, or equivalent work experience Salary Range Information The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About Lambda Founded in 2012, with 500+ employees, and growing fast Our investors notably include TWG Global, US Innovative Technology Fund (USIT), Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, Gradient Ventures, Mercato Partners, SVB, 1517, and Crescent Cove We have research papers accepted at top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG Our values are publicly available: We offer generous cash & equity compensation Health, dental, and vision coverage for you and your dependents Wellness and commuter stipends for select roles 401k Plan with 2% company match (USA employees) Flexible paid time off plan that we all actually use A Final Note: You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills. Equal Opportunity Employer Lambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
$126k - $180k
...with greater scale, reach, and impact. Customer Support (Ledger Operations) As a team within the Support group, the team is data driven... ...customer-centric. Team members work closely with data scientists, engineers, product managers, and corporate operational stakeholders to...SuggestedWork at officeRemote workFlexible hours$101.19k - $197.5k
...Apple Inc. is seeking a Lab Operations Engineer in San Francisco, California. In this role, you will manage test equipment and board setups in a semiconductor test lab, ensuring operational efficiency and effective troubleshooting. The ideal candidate has a bachelor’s...Suggested$10k
...do; become the connective tissue between Marketing, Sales, and Data Who You Are: ~5+ years of experience in Product/GTM Engineering at a high-growth or Start-up AI company with an agent-forward (aptitude) ~ Demonstrated ability to write production-quality...SuggestedSleeping nightsFlexible hours$72.93 per hour
...Operations Engineer - Union When you join Hines, you will embark on a career journey fueled by vision and guided by leaders who set the standards of our industry. Our legacy is rooted in innovation and excellence, earning us a spot on Fast Company's esteemed annual...SuggestedFor contractorsWork at officeLocal area- ...Group's Ferry and Transportation Division, is the largest private operator of high-speed passenger and vehicle ferries in the United... ...Position Summary: Executes routine and maintenance and engineering of both the vessels and facilities. Follows up on jobs and consults...SuggestedPermanent employmentFor contractorsLocal area
$300 per month
...company built from the ground up, we own and operate each layer of the stack — from electrons... .... We are looking for our first Principal Engineer on our Production Engineering team.... ...sometimes in the same incident Experience with HPC infrastructure: GPU cluster operations,...Full timeTemporary workImmediate start- ...Partner Operations Senior Engineer San Francisco, CA Location: In office, San Francisco (HQ) Experience: 6-9 years Reports To: Director of SI Partner Solutions Focus: Own infrastructure, pipelines, and production apps supporting our partner ecosystem We are...Full timeWork at officeFlexible hours
$101.19k - $197.5k
San Francisco Bay Area, California, United States Hardware Apple's silicon engineering labs are where breakthrough ideas meet rigorous validation. As a Lab Operations Engineer, you will be at the center of a cross‑functional lab environment responsible for keeping our...Relocation$192k - $240k
...Security Operations Engineer Brex is the intelligent finance platform that enables companies to spend smarter and move faster in more than 200 markets. By combining global corporate cards and banking with intuitive spend management, bill pay, and travel software, Brex...Work experience placementWork at officeRemote workWork from home- ...IT Operations Engineer Operations · Full-time · San Francisco Our mission is to automate coding. The first step in our journey is to build the best tool for professional programmers, using a combination of inventive research, design, and engineering. Our organization...Full timeWork at officeImmediate startRemote work
- Neier Inc. is seeking a Slack Administrator for a contract position based in our San Francisco office. In this role, you will primarily manage our Jira ticket queue while providing hardware fulfillment and macOS user support. Effective communication and Slack management...Contract workWork at office
- ...Senior Systems Operations Engineer (Mainframe Production Support) We are not accepting C2C or 1099 arrangements. Location: Chandler, AZ (preferred); Des Moines, IA (secondary) Schedule: Flexible 8-hour shift between 6:00 AM - 6:00 PM PST Employment...Full timeContract workFlexible hoursNight shift
$20 - $30 per hour
The Role We’re looking for Robot Operations Engineer (Operator) who are passionate about robotics. This is a hands‑on role where you will operate bimanual robots in R&D environments. You will follow structured procedures to collect high‑quality robot data. This data is...Hourly payContract workWork at office$110k - $120k
...Audax is a leading capital partner for middle market companies, operating through three business lines: Audax Private Equity, Audax... ...follow us on LinkedIn. POSITION SUMMARY: The IT Operations Engineer serves as the sole on-site IT resource for Audax Group's San...Contract workWork at officeLocal areaRemote workRelocationNight shift- ...BAVA (Baseline App Vulnerability Assessment) Operations & Support Engineer Contract One of our clients in Bay Area, it’s a long term contract opportunity. The candidate needs to be strong only with CheckMarx. Please let me know your interest and availability. One of our...Long term contractContract work
$150k - $205k
...world’s best investors, from Andreessen Horowitz to Blackrock and Fidelity, and employs a team of 450 engineers and entrepreneurs. Astranis designs, builds, and operates its satellites out of its 153,000 sq. ft. headquarters in Northern California, USA. Security...Permanent employmentFlexible hours$118k - $169k
...predictive models is at the heart of what we do. Our Machine Learning Operations team enables our Data Scientists to be able to build and... ...models, serving predictions in real time. The Sr. ML Ops Engineer will partner with our Data Science, Data Product Management, Product...Hourly payWork experience placementWork at officeImmediate startVisa sponsorshipWork visaFlexible hours- ...Application Support/ Operational support/ Systems Operations Engineer/ Production Support Engineer Charlotte, NC or Dallas, TX or New York City, NY or Iselin, New Jersey or Minneapolis, MN or Des Moines, IA or San Francisco, CA (Hybrid 3 days onsite in a week...Work at officeFlexible hoursShift workWeekend work3 days per week
- Hornblower Corp is seeking a Chief Engineer in San Francisco to manage engineering operations for vessels. Responsibilities include performing repairs, overseeing maintenance, and managing engineering teams. Candidates should have a valid US Coast Guard Chief Engineer...
$100k - $137k
...power the shop local movement. If you believe in community, come join ours. About this role Faire is looking for an IT Operations Automation Engineer to join our IT Operations team in San Francisco. We're building IT Support as a product — instrumented, measurable,...Work at officeLocal areaRemote workFlexible hours3 days per week- Hornblower Group in San Francisco is seeking a Chief Engineer to oversee engineering and maintenance of vessels. The role requires a U.S. Coast Guard Licensed Chief Engineer with at least 10 years of marine experience, including 5 years in leadership. You will manage repairs...
$300 per month
...Engineering Manager Crusoe is on a mission to accelerate the abundance of energy and intelligence... ...built from the ground up, we own and operate each layer of the stack — from electrons... ...serving platforms Background in HPC orchestration tools such as Slurm or Ray...Temporary work- ...A tech solutions company in San Francisco is looking for a BAVA Operations & Support Engineer to provide ongoing support for vulnerability assessments. This role requires strong technical expertise in CheckMarx, especially in maintaining and enhancing application security...Long term contract
- ...Revenue Operations and Engineering Manager Marketing San Francisco Description We’re not here to blend in—we’re here to redefine what observability means and how it’s delivered. groundcover is a fast-growing Series B startup delivering the industry’s most modern observability...Work at office3 days per week
- ...A venture-backed fintech startup in San Francisco is seeking a senior Data Engineer to join their growing team. You will be responsible for designing and building essential data and operational systems that support analytics and decision-making. Ideal candidates will have...Work at officeRelocation package
$106.8k - $194.8k
...diverse teams and take your career wherever you want it to go. Join EY and help to build a better working world. WAF Operations Solution Engineer PRACTICE DESCRIPTION: As a WAF Operations Solution Engineer, you will be responsible for implementing and...Summer holidayFlexible hours$225k - $237.5k
...experience to a new industry, join our team as we help shape a brighter way forward.**Summary of Job Description:**The Director of Operations & Engineering is responsible for the operational management and effective daily leadership and administration of the technical team and...For contractorsWork experience placementLocal areaWeekend work- About HappyRobot HappyRobot is the AI-native operating system for the real economy—a system that closes the circuit between intelligence... .... Role Overview: We are looking for a Product Operations Engineer to act as the operational backbone of our product organization...Shift work
- A fast-growing AI startup seeks a Product Operations Engineer to enhance operations across teams. This role connects Engineering, Deployments, and Sales, ensuring smooth execution from product development to market launch. The ideal candidate has over 3 years of experience...
$70 per hour
Are you a Level 3 / Tier 3 network support engineer interested in data science and autonomous infrastructure? Our client is building vertically integrated networking systems and using the data they generate to power the next generation of AI-driven infrastructure. They...Immediate start
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to HPC Operations Engineer. Be the first to apply!
- security operations center engineer San Francisco, CA
- production operations engineer San Francisco, CA
- remote operation drilling engineer San Francisco, CA
- network operations center engineer San Francisco, CA
- operations quality engineer San Francisco, CA
- senior security operations engineer San Francisco, CA
- senior production engineer San Francisco, CA
- operations engineer San Francisco, CA
- data operations engineer San Francisco, CA
- application operations engineer San Francisco, CA

