HPC Operations Engineer
Lambda Inc.
Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU. If you'd like to build the world's best AI cloud, join us. *Note: This position requires presence in our San Francisco / Bellevue office location 4 days per week; Lambda’s designated work from home day is currently Tuesday. Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance. What You’ll Do Remotely deploy and configure large-scale HPC clusters for AI workloads (up to many thousands of nodes) Remotely install and configure operating systems, firmware, software, and networking on HPC clusters both manually and using automation tools Troubleshoot and resolve HPC cluster issues working closely with physical deployment teams on-site Provide clear and detailed requirements back to other engineering teams on gaps and improvement areas, specifically in the areas of simplification, stability, and operational efficiency Contribute to the creation of and maintenance of Standard Operating Procedures Provide regular and well-communicated updates to project leads throughout each deployment Mentor and assist less experienced team members Stay up-to-date on the latest HPC/AI technologies and best practices You Are a deeply experienced HPC engineer comfortable with logical provisioning of a cluster Have a strong understanding of HPC/AI architecture, operating systems, firmware, software, and networking 5+ years of experience in deploying and configuring HPC clusters for AI workloads Have an innate attention to detail Are in expert in configuring and troubleshooting: SFP+ fiber, Infiniband (IB), and 100 GbE network fabrics Ethernet, switching, power infrastructure, GPU direct, RDMA, NCCL, Horovod environments Linux based compute nodes, firmware updates, driver installation SLURM, Kubernetes, or other job scheduling systems Work well under deadlines and structured project plans also knowing when and how to ask for changes to project timelines Have excellent problem solving and troubleshooting skills Have flexibility to travel to our North American data centers as on-site needs arise or as part of training exercises Are able to work independently and as part of a team Are comfortable mentoring and supporting junior HPC engineers on cluster deployments Nice to Have Experience with machine learning and deep learning frameworks (PyTorch, Tensorflow) and benchmarking tools (DeepSpeed, MLPerf) Experience with containerization technologies (Docker, Kubernetes) Experience working with the technologies that underpin our cloud business (GPU acceleration, virtualization, and cloud computing) Keen situational awareness in customer situations, employing diplomacy and tact Bachelors degree in EE, CS, Physics, Mathematics, or equivalent work experience Salary Range Information The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. Benefits We offer generous cash & equity compensation Health, dental, and vision coverage for you and your dependents Wellness and commuter stipends for select roles 401k Plan with 2% company match (USA employees) Flexible paid time off plan that we all actually use A Final Note You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills. Equal Opportunity Employer Lambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law. #J-18808-Ljbffr Lambda Inc.
- Lambda Inc. is seeking an experienced HPC Engineer to join our team in San Francisco. In this role, you will be responsible for deploying and configuring large-scale HPC clusters for AI workloads, troubleshooting issues, and mentoring junior engineers. The ideal candidate...Suggested
- Neura Market is seeking an HPC Engineer to build and configure large-scale HPC clusters for AI workloads. This role requires working 4 days a week onsite in San Francisco/Bellevue, where you will collaborate closely with teams to troubleshoot and improve systems. The ideal...Suggested
- ...About HappyRobot HappyRobot is the AI-native operating system for the real economy—a system that closes the circuit between intelligence... ...Role Overview: We are looking for a Product Operations Engineer to act as the operational backbone of our product organization...SuggestedShift work
$300 per month
...company built from the ground up, we own and operate each layer of the stack — from electrons... ...cloud platform — and Production Engineering sits at the heart of that mission. As a Production... ...that supports demanding AI and HPC workloads. You’ll partner closely with...SuggestedTemporary work- ...time of deploying sensors by 10x. Our platform will ultimately become the perception engine for a company’s physical footprint, enabling real-time perimeter visibility, autonomous operations management, and “digital twinning” of physical processes. Our co-founders...SuggestedLocal areaRemote work
$153k - $187k
...outcomes in our decisions and actions. We Default to Disclosure by operating with transparency and integrity, ensuring trust and... ...Operations function is modernizing and re-architecting the revenue engine, leading the build of an AI-native GTM operations model that connects...ApprenticeshipLocal areaRemote workFlexible hoursShift work- ...Group's Ferry and Transportation Division, is the largest private operator of high-speed passenger and vehicle ferries in the United... ...Position Summary: Executes routine and maintenance and engineering of both the vessels and facilities. Follows up on jobs and consults...Permanent employmentFor contractorsLocal area
$101.19k - $197.5k
San Francisco Bay Area, California, United States Hardware Apple's silicon engineering labs are where breakthrough ideas meet rigorous validation. As a Lab Operations Engineer, you will be at the center of a cross‑functional lab environment responsible for keeping our...Relocation$101.19k - $197.5k
Apple Inc. is seeking a Lab Operations Engineer in San Francisco, California. In this role, you will manage test equipment and board setups in a semiconductor test lab, ensuring operational efficiency and effective troubleshooting. The ideal candidate has a bachelor’s degree...$192k - $240k
...Security Operations Engineer Brex is the intelligent finance platform that enables companies to spend smarter and move faster in more than 200 markets. By combining global corporate cards and banking with intuitive spend management, bill pay, and travel software, Brex...Work experience placementWork at officeRemote workWork from home$150k - $205k
...world’s best investors, from Andreessen Horowitz to Blackrock and Fidelity, and employs a team of 450 engineers and entrepreneurs. Astranis designs, builds, and operates its satellites out of its 153,000 sq. ft. headquarters in Northern California, USA. Security...Permanent employmentFlexible hours$126k - $180k
...with greater scale, reach, and impact. Customer Support (Ledger Operations) As a team within the Support group, the team is data driven... ...-centric. Team members work closely with data scientists, engineers, product managers, and corporate operational stakeholders to reconcile...Work at officeRemote workFlexible hours- HackerOne Inc. is seeking a Senior GTM Operations Engineer to modernize its revenue engine using AI. You will partner with Marketing, Sales, and other departments to redesign and implement efficient operations. This remote role targets individuals within approximately 5...Remote job
$72.93 per hour
...Best Companies to Work For in 2024. Discover endless opportunities to grow and make your mark at Hines. Responsibilities As an Operations Engineer - Union with Hines, you will maintain basic operation and maintenance of all building equipment and systems. Responsibilities...For contractors- Nuon Inc. is seeking a Forward Deployed Engineer in San Francisco. This role involves owning the operational success for our enterprise BYOC customers, deploying and operating dedicated control plane instances. You will be the technical owner of installations, ensuring...
$10k
...tools can’t do; become the connective tissue between Marketing, Sales, and Data. Who You Are 5+ years of experience in marketing operations, revenue operations, or a related function at a high‑growth company with an agent‑forward aptitude. Fluency in UTM structures, attribution...Sleeping nightsFlexible hours$300 per month
...company built from the ground up, we own and operate each layer of the stack — from electrons... ...runs on. We are looking for a Principal Engineer on our Production Engineering team.... ...in the same incident Experience with HPC infrastructure: GPU cluster operations, job...Full timeTemporary workImmediate start- ...contracts and a mission to put an intelligent robot in every commercial kitchen. About the Role Chef is seeking a Robotic Operations Engineer to be the first point of contact for incoming issues at a customer site. You'll work from our SF office, monitoring and...Work at officeRemote workFlexible hoursShift work
- Neier Inc. is seeking a Slack Administrator for a contract position based in our San Francisco office. In this role, you will primarily manage our Jira ticket queue while providing hardware fulfillment and macOS user support. Effective communication and Slack management...Contract workWork at office
$110k - $120k
...Audax is a leading capital partner for middle market companies, operating through three business lines: Audax Private Equity, Audax... ...follow us on LinkedIn. POSITION SUMMARY: The IT Operations Engineer serves as the sole on-site IT resource for Audax Group's San...Contract workWork at officeLocal areaRemote workRelocationNight shift- ...Partner Operations Senior Engineer San Francisco, CA Location: In office, San Francisco (HQ) Experience: 6-9 years Reports To: Director of SI Partner Solutions Focus: Own infrastructure, pipelines, and production apps supporting our partner ecosystem We are...Full timeWork at officeFlexible hours
$70 per hour
...Benchmark , General Catalyst , Peter Thiel , Adam D'Angelo , Larry Summers , and Jack Dorsey . Position: Network Engineer - Data for Autonomous Systems annotation Type: Contract Compensation: $50–$70/hour Location: Remote Commitment...Contract workSummer workRemote work$118k - $169k
...predictive models is at the heart of what we do. Our Machine Learning Operations team enables our Data Scientists to be able to build and... ...models, serving predictions in real time. The Sr. ML Ops Engineer will partner with our Data Science, Data Product Management, Product...Hourly payWork experience placementWork at officeImmediate startVisa sponsorshipWork visaFlexible hours- BAVA (Baseline App Vulnerability Assessment) Operations & Support Engineer Contract One of our clients in Bay Area, it’s a long term contract opportunity. The candidate needs to be strong only with CheckMarx. Please let me know your interest and availability. One of our...Long term contractContract work
- Hornblower Group in San Francisco is seeking a Chief Engineer to oversee engineering and maintenance of vessels. The role requires a U.S. Coast Guard Licensed Chief Engineer with at least 10 years of marine experience, including 5 years in leadership. You will manage repairs...
$140k - $180k
A cutting-edge robotics startup in San Francisco is seeking a Senior Robotics Test Engineer to lead the reliability and precision of robotics software for their automated manufacturing systems. This role involves building and maintaining testing architecture, collaborating...$300 per month
...Engineering Manager Crusoe is on a mission to accelerate the abundance of energy and intelligence... ...built from the ground up, we own and operate each layer of the stack — from electrons... ...serving platforms Background in HPC orchestration tools such as Slurm or Ray...Temporary work$106.8k - $194.8k
...diverse teams and take your career wherever you want it to go. Join EY and help to build a better working world. WAF Operations Solution Engineer PRACTICE DESCRIPTION: As a WAF Operations Solution Engineer, you will be responsible for implementing and...Summer holidayFlexible hours- A fast-growing AI startup seeks a Product Operations Engineer to enhance operations across teams. This role connects Engineering, Deployments, and Sales, ensuring smooth execution from product development to market launch. The ideal candidate has over 3 years of experience...
- A high-growth AI startup is seeking a Product Operations Engineer in San Francisco to coordinate between Product Engineering and Deployment teams. This role involves driving technical enablement, tracking key metrics, and managing data operations. The ideal candidate has...
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to HPC Operations Engineer. Be the first to apply!
- security operations center engineer San Francisco, CA
- production operations engineer San Francisco, CA
- remote operation drilling engineer San Francisco, CA
- network operations center engineer San Francisco, CA
- operations quality engineer San Francisco, CA
- senior security operations engineer San Francisco, CA
- senior production engineer San Francisco, CA
- operations engineer San Francisco, CA
- data operations engineer San Francisco, CA
- application operations engineer San Francisco, CA



