Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

HPC Operations Engineer

Full-time

Lambda

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU. If you'd like to build the world's best AI cloud, join us. *Note: This position requires presence in our San Francisco / Bellevue office location 4 days per week; Lambda’s designated work from home day is currently Tuesday. Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance. What You’ll Do Remotely deploy and configure large-scale HPC clusters for AI workloads (up to many thousands of nodes) Remotely install and configure operating systems, firmware, software, and networking on HPC clusters both manually and using automation tools Troubleshoot and resolve HPC cluster issues working closely with physical deployment teams on-site Provide clear and detailed requirements back to other engineering teams on gaps and improvement areas, specifically in the areas of simplification, stability, and operational efficiency Contribute to the creation of and maintenance of Standard Operating Procedures Provide regular and well-communicated updates to project leads throughout each deployment Mentor and assist less experienced team members Stay up-to-date on the latest HPC/AI technologies and best practices You Are a deeply experienced HPC engineer comfortable with logical provisioning of a cluster Have a strong understanding of HPC/AI architecture, operating systems, firmware, software, and networking 5+ years of experience in deploying and configuring HPC clusters for AI workloads Have an innate attention to detail Are in expert in configuring and troubleshooting: SFP+ fiber, Infiniband (IB), and 100 GbE network fabrics Ethernet, switching, power infrastructure, GPU direct, RDMA, NCCL, Horovod environments Linux based compute nodes, firmware updates, driver installation SLURM, Kubernetes, or other job scheduling systems Work well under deadlines and structured project plans also knowing when and how to ask for changes to project timelines Have excellent problem solving and troubleshooting skills Have flexibility to travel to our North American data centers as on-site needs arise or as part of training exercises Are able to work independently and as part of a team Are comfortable mentoring and supporting junior HPC engineers on cluster deployments Nice to Have Experience with machine learning and deep learning frameworks (PyTorch, Tensorflow) and benchmarking tools (DeepSpeed, MLPerf) Experience with containerization technologies ( Docker, Kubernetes) Experience working with the technologies that underpin our cloud business ( GPU acceleration, virtualization, and cloud computing) Keen situational awareness in customer situations, employing diplomacy and tact Bachelors degree in EE, CS, Physics, Mathematics, or equivalent work experience Salary Range Information The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About Lambda Founded in 2012, with 500+ employees, and growing fast Our investors notably include TWG Global, US Innovative Technology Fund (USIT), Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, Gradient Ventures, Mercato Partners, SVB, 1517, and Crescent Cove We have research papers accepted at top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG Our values are publicly available: We offer generous cash & equity compensation Health, dental, and vision coverage for you and your dependents Wellness and commuter stipends for select roles 401k Plan with 2% company match (USA employees) Flexible paid time off plan that we all actually use A Final Note: You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills. Equal Opportunity Employer Lambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the HPC Operations Engineer in San Francisco, CA vacancy
  • $126k - $180k

     ...with greater scale, reach, and impact. Customer Support (Ledger Operations) As a team within the Support group, the team is data driven...  ...customer-centric. Team members work closely with data scientists, engineers, product managers, and corporate operational stakeholders to... 
    Suggested
    Work at office
    Remote work
    Flexible hours

    Skydrop

    San Francisco, CA
    1 day ago
  • $101.19k - $197.5k

     ...Apple Inc. is seeking a Lab Operations Engineer in San Francisco, California. In this role, you will manage test equipment and board setups in a semiconductor test lab, ensuring operational efficiency and effective troubleshooting. The ideal candidate has a bachelor’s... 
    Suggested

    Apple

    San Francisco, CA
    1 day ago
  • $10k

     ...do; become the connective tissue between Marketing, Sales, and Data Who You Are: ~5+ years of experience in Product/GTM Engineering at a high-growth or Start-up AI company with an agent-forward (aptitude) ~ Demonstrated ability to write production-quality... 
    Suggested
    Sleeping nights
    Flexible hours

    VAPI

    San Francisco, CA
    5 days ago
  • $72.93 per hour

     ...Operations Engineer - Union When you join Hines, you will embark on a career journey fueled by vision and guided by leaders who set the standards of our industry. Our legacy is rooted in innovation and excellence, earning us a spot on Fast Company's esteemed annual... 
    Suggested
    For contractors
    Work at office
    Local area

    Hines

    San Francisco, CA
    16 hours ago
  •  ...Group's Ferry and Transportation Division, is the largest private operator of high-speed passenger and vehicle ferries in the United...  ...Position Summary: Executes routine and maintenance and engineering of both the vessels and facilities. Follows up on jobs and consults... 
    Suggested
    Permanent employment
    For contractors
    Local area

    Hornblower Group

    San Francisco, CA
    3 days ago
  • $300 per month

     ...company built from the ground up, we own and operate each layer of the stack — from electrons...  .... We are looking for our first Principal Engineer on our Production Engineering team....  ...sometimes in the same incident Experience with HPC infrastructure: GPU cluster operations,... 
    Full time
    Temporary work
    Immediate start

    Crusoe

    San Francisco, CA
    1 day ago
  •  ...Partner Operations Senior Engineer San Francisco, CA Location: In office, San Francisco (HQ) Experience: 6-9 years Reports To: Director of SI Partner Solutions Focus: Own infrastructure, pipelines, and production apps supporting our partner ecosystem We are... 
    Full time
    Work at office
    Flexible hours

    Sigma Computing

    San Francisco, CA
    3 days ago
  • $101.19k - $197.5k

    San Francisco Bay Area, California, United States Hardware Apple's silicon engineering labs are where breakthrough ideas meet rigorous validation. As a Lab Operations Engineer, you will be at the center of a cross‑functional lab environment responsible for keeping our... 
    Relocation

    Apple Inc.

    San Francisco, CA
    2 days ago
  • $192k - $240k

     ...Security Operations Engineer Brex is the intelligent finance platform that enables companies to spend smarter and move faster in more than 200 markets. By combining global corporate cards and banking with intuitive spend management, bill pay, and travel software, Brex... 
    Work experience placement
    Work at office
    Remote work
    Work from home

    Brex

    San Francisco, CA
    2 days ago
  •  ...IT Operations Engineer Operations · Full-time · San Francisco Our mission is to automate coding. The first step in our journey is to build the best tool for professional programmers, using a combination of inventive research, design, and engineering. Our organization... 
    Full time
    Work at office
    Immediate start
    Remote work

    Anysphere

    San Francisco, CA
    6 days ago
  • Neier Inc. is seeking a Slack Administrator for a contract position based in our San Francisco office. In this role, you will primarily manage our Jira ticket queue while providing hardware fulfillment and macOS user support. Effective communication and Slack management...
    Contract work
    Work at office

    Neier Inc.

    San Francisco, CA
    2 days ago
  •  ...Senior Systems Operations Engineer (Mainframe Production Support) We are not accepting C2C or 1099 arrangements. Location: Chandler, AZ (preferred); Des Moines, IA (secondary) Schedule: Flexible 8-hour shift between 6:00 AM - 6:00 PM PST Employment... 
    Full time
    Contract work
    Flexible hours
    Night shift

    The Judge Group

    San Francisco, CA
    4 days ago
  • $20 - $30 per hour

    The Role We’re looking for Robot Operations Engineer (Operator) who are passionate about robotics. This is a hands‑on role where you will operate bimanual robots in R&D environments. You will follow structured procedures to collect high‑quality robot data. This data is... 
    Hourly pay
    Contract work
    Work at office

    Verne Robotics

    San Francisco, CA
    16 hours ago
  • $110k - $120k

     ...Audax is a leading capital partner for middle market companies, operating through three business lines: Audax Private Equity, Audax...  ...follow us on LinkedIn. POSITION SUMMARY: The IT Operations Engineer serves as the sole on-site IT resource for Audax Group's San... 
    Contract work
    Work at office
    Local area
    Remote work
    Relocation
    Night shift

    Audax Group

    San Francisco, CA
    2 days ago
  •  ...BAVA (Baseline App Vulnerability Assessment) Operations & Support Engineer Contract One of our clients in Bay Area, it’s a long term contract opportunity. The candidate needs to be strong only with CheckMarx. Please let me know your interest and availability. One of our... 
    Long term contract
    Contract work

    Bridge Technologies and Solutions

    San Francisco, CA
    3 days ago
  • $150k - $205k

     ...world’s best investors, from Andreessen Horowitz to Blackrock and Fidelity, and employs a team of 450 engineers and entrepreneurs. Astranis designs, builds, and operates its satellites out of its 153,000 sq. ft. headquarters in Northern California, USA. Security... 
    Permanent employment
    Flexible hours

    Astranis

    San Francisco, CA
    6 days ago
  • $118k - $169k

     ...predictive models is at the heart of what we do. Our Machine Learning Operations team enables our Data Scientists to be able to build and...  ...models, serving predictions in real time. The Sr. ML Ops Engineer will partner with our Data Science, Data Product Management, Product... 
    Hourly pay
    Work experience placement
    Work at office
    Immediate start
    Visa sponsorship
    Work visa
    Flexible hours

    Early Warning Services, LLC

    San Francisco, CA
    3 days ago
  •  ...Application Support/ Operational support/ Systems Operations Engineer/ Production Support Engineer Charlotte, NC or Dallas, TX or New York City, NY or Iselin, New Jersey or Minneapolis, MN or Des Moines, IA or San Francisco, CA (Hybrid 3 days onsite in a week... 
    Work at office
    Flexible hours
    Shift work
    Weekend work
    3 days per week

    Syntricate Technologies

    San Francisco, CA
    5 days ago
  • Hornblower Corp is seeking a Chief Engineer in San Francisco to manage engineering operations for vessels. Responsibilities include performing repairs, overseeing maintenance, and managing engineering teams. Candidates should have a valid US Coast Guard Chief Engineer... 

    Hornblower Corp

    San Francisco, CA
    3 days ago
  • $100k - $137k

     ...power the shop local movement. If you believe in community, come join ours. About this role Faire is looking for an IT Operations Automation Engineer to join our IT Operations team in San Francisco. We're building IT Support as a product — instrumented, measurable,... 
    Work at office
    Local area
    Remote work
    Flexible hours
    3 days per week

    Faire Inc

    San Francisco, CA
    4 days ago
  • Hornblower Group in San Francisco is seeking a Chief Engineer to oversee engineering and maintenance of vessels. The role requires a U.S. Coast Guard Licensed Chief Engineer with at least 10 years of marine experience, including 5 years in leadership. You will manage repairs... 

    Hornblower Group

    San Francisco, CA
    3 days ago
  • $300 per month

     ...Engineering Manager Crusoe is on a mission to accelerate the abundance of energy and intelligence...  ...built from the ground up, we own and operate each layer of the stack — from electrons...  ...serving platforms Background in HPC orchestration tools such as Slurm or Ray... 
    Temporary work

    Crusoe

    San Francisco, CA
    3 days ago
  •  ...A tech solutions company in San Francisco is looking for a BAVA Operations & Support Engineer to provide ongoing support for vulnerability assessments. This role requires strong technical expertise in CheckMarx, especially in maintaining and enhancing application security... 
    Long term contract

    Bridge Technologies and Solutions

    San Francisco, CA
    4 days ago
  •  ...Revenue Operations and Engineering Manager Marketing San Francisco Description We’re not here to blend in—we’re here to redefine what observability means and how it’s delivered. groundcover is a fast-growing Series B startup delivering the industry’s most modern observability... 
    Work at office
    3 days per week

    Jibe Ventures

    San Francisco, CA
    1 day ago
  •  ...A venture-backed fintech startup in San Francisco is seeking a senior Data Engineer to join their growing team. You will be responsible for designing and building essential data and operational systems that support analytics and decision-making. Ideal candidates will have... 
    Work at office
    Relocation package

    Pinkmoonconsulting

    San Francisco, CA
    2 days ago
  • $106.8k - $194.8k

     ...diverse teams and take your career wherever you want it to go.  Join EY and help to build a better working world. WAF Operations Solution Engineer PRACTICE DESCRIPTION: As a WAF Operations Solution Engineer, you will be responsible for implementing and... 
    Summer holiday
    Flexible hours

    EY

    San Francisco, CA
    3 days ago
  • $225k - $237.5k

     ...experience to a new industry, join our team as we help shape a brighter way forward.**Summary of Job Description:**The Director of Operations & Engineering is responsible for the operational management and effective daily leadership and administration of the technical team and... 
    For contractors
    Work experience placement
    Local area
    Weekend work

    Jones Lang LaSalle Incorporated

    San Francisco, CA
    1 day ago
  • About HappyRobot HappyRobot is the AI-native operating system for the real economy—a system that closes the circuit between intelligence...  .... Role Overview: We are looking for a Product Operations Engineer to act as the operational backbone of our product organization... 
    Shift work

    Happyrobot Inc.

    San Francisco, CA
    1 day ago
  • A fast-growing AI startup seeks a Product Operations Engineer to enhance operations across teams. This role connects Engineering, Deployments, and Sales, ensuring smooth execution from product development to market launch. The ideal candidate has over 3 years of experience... 

    Happyrobot Inc.

    San Francisco, CA
    1 day ago
  • $70 per hour

    Are you a Level 3 / Tier 3 network support engineer interested in data science and autonomous infrastructure? Our client is building vertically integrated networking systems and using the data they generate to power the next generation of AI-driven infrastructure. They... 
    Immediate start

    Obsidian

    San Francisco, CA
    16 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to HPC Operations Engineer. Be the first to apply!