Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

GPU Cloud Platform Engineer

Yotta Labs

Location: Remote (Global) Type: Full-time Company: Yotta Labs Apply: View email address on click.appcast.io About Yotta Labs Yotta Labs is pioneering the development of a Decentralized Operating System (DeOS) for AI workload orchestration at a planetary scale. Our mission is to democratize access to AI resources by aggregating geo-distributed GPUs, enabling high-performance computing for AI training and inference on a wide spectrum of hardware—from commodity to high-end GPUs. Our platform supports major large language models (LLMs) and offers customizable solutions for new models, facilitating elastic and efficient AI development. ️ Role Overview We are seeking a GPU Cloud Platform Engineer to join our core infrastructure team and help build the next-generation AI compute cloud. In this role, you will design, deploy, and operate large-scale, multi-cluster GPU infrastructure across data centers and cloud environments. You will be responsible for ensuring high availability, performance, and efficiency of containerized AI workloads—ranging from LLMs to generative models—deployed in Kubernetes-based GPU clusters. If you're passionate about high-performance systems, distributed orchestration, and scaling real-world AI infrastructure, this role offers a unique opportunity to shape the backbone of our AI cloud platform. Responsibilities Build and operate large-scale, high-performance GPU clusters; ensure stable operation of compute, network, and storage systems; monitor and troubleshoot online issues. Conduct performance testing and evaluation of multi-node GPU clusters using standard benchmarking tools to identify and resolve performance bottlenecks. Deploy and orchestrate large models (e.g., LLMs, video generation models) across multi-cluster environments using Kubernetes; implement elastic scaling and cross-cluster load balancing to ensure efficient service response under high concurrency for global users. Participate in the design, development, and iteration of GPU cluster scheduling and optimization systems. Define and lead Kubernetes multi-cluster configuration standards; Optimize scheduling strategies (e.g., node affinity, taints/tolerations) to improve GPU resource utilization. Build a unified multi-cluster management and monitoring system to support cross-region resource monitoring, traffic scheduling, and fault failover. Collect key metrics such as GPU memory usage, QPS, and response latency in real time; configure alert mechanisms. Coordinate with IDC providers for planning and deploying large-scale GPU clusters, networks, and storage infrastructure to support internal cloud platforms and external customer needs. ✅ Qualifications Bachelor's degree or higher in Computer Science, Software Engineering, Electronic Engineering, or related fields; 3+ years of experience in system engineering or DevOps. 5+ years of experience in cloud-native development or AI engineering, with at least 2 years of hands‑on experience in Kubernetes multi-cluster management and orchestration. Familiarity with the Kubernetes ecosystem; hands‑on experience with tools such as kubectl, Helm, and expertise in multi‑cluster deployment, upgrade, scaling, and disaster recovery. Proficient in Docker and containerization technologies; knowledge of image management and cross-cluster distribution. Experience with monitoring tools such as Prometheus and Grafana; Has practical experience in GPU fault monitoring and alerting. Hands‑on experience with cloud platforms such as AWS, GCP, or Azure; understanding of cloud-native multi-cluster architecture. Experience with cluster management tools such as Ray, Slurm, KubeSphere, Rancher, Karmada is a plus. Familiarity with distributed file systems such as NFS, JuiceFS, CephFS, or Lustre; ability to diagnose and resolve performance bottlenecks. Understanding of high-performance communication protocols such as IB, RoCE, NVLink, and PCIe. Strong communication skills, self‑motivation, and team collaboration Preferred Experience Experience in developing and operating MaaS platforms or large-scale model inference clusters. Proven track record of leading multi-cluster system development or performance optimization projects. Proficiency in CUDA programming and the NCCL communication library; understanding of high-performance GPUs like H100. Ability to develop standardized inference APIs (RESTful/gRPC) and automation tools using Golang or Python. Hands‑on experience with optimization techniques such as model quantization, static compilation, and multi‑GPU parallelism; capable of profiling inference processes in multi-cluster setups and identifying bottlenecks like memory fragmentation and low compute efficiency. Active engagement with open-source communities such as Hugging Face and GitHub; deep understanding of the design principles of inference frameworks like Triton, vLLM, and SGLang; ability to perform secondary development and optimization based on open-source projects and quickly translate cutting-edge techniques into production-ready multi-cluster solutions. Why Join Yotta Labs? Be part of a visionary team aiming to redefine AI infrastructure. Work on cutting-edge technologies that bridge AI and decentralized computing. Collaborate with experts from leading institutions and tech companies. Enjoy a flexible, remote work environment that values innovation and autonomy. How to Apply Interested candidates should apply directly or send their resume and a brief cover letter View email address on click.appcast.io. Please include links to any relevant projects or contributions. #J-18808-Ljbffr Yotta Labs

Vacancy posted 3 days ago
Similar jobs that could be interesting for youBased on the GPU Cloud Platform Engineer in New York, NY vacancy
  • A pioneering AI infrastructure company is seeking a GPU Cloud Platform Engineer to design and operate large-scale GPU clusters. This remote position aims to ensure high availability and performance of containerized AI workloads across cloud environments. The ideal candidate... 
    Cloud
    Remote job

    Yotta Labs

    New York, NY
    3 days ago
  • Alumni Ventures is hiring for a Platform Engineering role in New York City, focused on developing an ultrafast AI inference platform. This position...  ...challenges like low-level systems development, and efficient GPU workload management. Successful candidates will have 3-5... 
    Cloud
    Remote job

    Alumni Ventures

    New York, NY
    3 days ago
  •  ...in New York is seeking an experienced infrastructure engineer to build backend services and manage cloud infrastructure. The successful candidate will work...  ...Ideal candidates will have over 3 years' experience in platform engineering, proficiently using technologies like... 
    Cloud

    triomics inc.

    New York, NY
    3 days ago
  • Runpod is seeking a Technical Content Writer to create engaging, in-depth content for our GPU cloud platform that informs and attracts our AI-centric audience. You will partner with marketing, product, and development teams to maintain documentation standards and enhance... 
    Cloud
    Remote job

    Runpod

    New York, NY
    1 day ago
  • Group: Impossible Cloud / Impossible Cloud Network (ICN) Focus:...  ...Enterprise Storage & Decentralized GPU Orchestration Location: Zug,...  ...to build an AI-first platform encompassing storage, compute...  ...Execution: Partner closely with engineering to define specifications, manage... 
    Cloud
    Work experience placement
    Remote work

    Impossible Cloud

    New York, NY
    2 days ago
  • A cutting-edge AI company in New York is seeking a skilled engineer to work on cluster management and GPU infrastructure. You will be responsible for building tools for monitoring and observability while collaborating closely with training teams. Ideal candidates have... 
    Cloud

    Reflection

    New York, NY
    17 hours ago
  • $180k - $250k

     ...philanthropic efforts support experienced engineers who are tasked with building...  ...for evaluating the complex GPU and AI/ML needs of world-class...  ...on a wide range of compute platforms. This role reports to the...  ...specific GPU, AI/ML, and HPC and cloud requirements. Technical... 
    Cloud
    Local area

    Schmidt Entities

    New York, NY
    3 days ago
  • $110k - $140k

     ...is on a mission to make high‑performance cloud infrastructure easy to use, affordable,...  ..., scalable, global Cloud Compute, Cloud GPU, Bare Metal, and Cloud Storage solutions...  ...seeking a highly skilled and experienced AI Platform Engineer to own the strategy and execution for... 
    Cloud
    Work at office
    Immediate start
    Remote work
    Flexible hours

    Vultr

    New York, NY
    3 days ago
  • A leading tech company in the United States is seeking an experienced Infrastructure GPU Engineer to build and support high-performance cloud infrastructure. This role involves optimizing resource allocation for GPU workloads, ensuring system reliability, and collaborating... 
    Cloud
    Remote job

    DevOpsChat

    New York, NY
    3 days ago
  • $155k - $215k

     ...with them. As our first dedicated ML Platform Engineer, you'll define the technical direction and...  ...today and are investing in hosted GPU inference to support the next generation...  ...infrastructure expertise ~ Familiarity with cloud ML services (AWS SageMaker, GCP Vertex... 
    Cloud
    Full time
    Work at office
    Local area

    Charlie Health Engineering, Product & Design

    New York, NY
    1 day ago
  • $140k - $200k

     ...building a proprietary AI and data platform that powers our investment...  ...structured finance. We are engineers and investors working together...  ...and real-time), including GPU compute provisioning and container...  ...between services. • Manage cloud infrastructure (Azure) including... 
    Cloud
    Flexible hours

    Anthelion Capital Holdings

    New York, NY
    17 hours ago
  •  ..., primarily in architecture, engineering, and construction, extract structured...  ..., and project files. Our platform combines embedding models,...  ...agents execute in customer cloud environments. You’ll own the...  ...infrastructure inference services, GPU workloads, model serving,... 
    Cloud

    Nomic, Inc.

    New York, NY
    17 hours ago
  • $160k - $287k

     ...building not only new products but also new platforms that reliably create value for both...  ...We are seeking a Senior CVML Platform Engineer to help design, build, and evolve the...  ...optimize hybrid compute environments (cloud + on‑prem/GPU infrastructure) to support large‑scale... 
    Cloud
    Remote job
    Full time
    Temporary work
    Immediate start
    Visa sponsorship

    Blue River Technology

    New York, NY
    3 days ago
  •  ...radiology and AI diagnostics platform delivering 24/7 imaging insights...  ...-throughput medical imaging, GPU-backed inference, global distribution...  ...predictable, and easier for engineers to build on. Why This Role...  ...Infrastructure and Cloud Own and evolve Radimal’s AWS and... 
    Cloud
    Remote job
    Local area

    Radimal

    New York, NY
    3 days ago
  • $60 - $85 per hour

     ...Job Description We are sharing a specialised part-time consulting opportunity for professionals experienced in cloud architecture, platform engineering, site reliability, DevOps, DevSecOps, cloud security, FinOps, and structured cloud infrastructure workflows. This... 
    Cloud
    Remote job
    Hourly pay
    Weekly pay
    Job sharing
    Contract work
    Part time
    For contractors
    Flexible hours

    24-MAG

    New York, NY
    4 days ago
  •  ...Container-based technologies Experience in any of the following cloud service providers - GCP, Azure or AWS. Sound experience with...  ...infrastructure and cloud computing Skills Desired 6+ years of Overall Engineering experience 4+ years of experience working in AWS, Azure or GCE... 
    Cloud

    ALLTECH CONSULTING SVC INC

    New York, NY
    3 days ago
  • A green technology company is seeking a Software Engineer II to join their remote-first Data Platform team. You will contribute to building and maintaining a multi-cloud Databricks infrastructure, working on CI/CD pipelines and facilitating effective use of Databricks... 
    Cloud
    Remote job

    Uplight, Inc

    New York, NY
    3 days ago
  • Senior Platform Engineer Why this Role Matters: At Greenbox Capital, we help small businesses thrive by giving them fast, accessible funding...  ...or migration initiatives Experience working in cloud-native environments (preferably Azure) Experience designing... 
    Cloud
    Remote work
    Flexible hours

    Greenbox Capital

    New York, NY
    3 days ago
  • $75k - $90k

    A cloud infrastructure company based in the United States is seeking an experienced RMA Technical Specialist to enhance the RMA process...  ...skills and must adapt to new technologies. Responsibilities include GPU/CPU troubleshooting, vendor interaction, and maintaining... 
    Cloud
    Local area

    Vultr

    New York, NY
    3 days ago
  • A leading media agency is seeking an experienced Data Engineer to design and build scalable cloud-based data platforms. The successful candidate will be responsible for both the architecture and operational reliability of the data pipelines, utilizing skills in Databricks... 
    Cloud
    Flexible hours

    Publicis Groupe Holdings B.V

    New York, NY
    1 day ago
  • $100k - $300k

     ...Founding- and Staff-level Engineers We are looking for Founding- and Staff-level Engineers...  ...foundational pillars of Cogent's data platform and integration pipeline in order to...  ...Terraform, Docker, Databricks, etc in multiple clouds For California Based Applicants... 
    Cloud

    Cogent Security, Inc.

    New York, NY
    2 days ago
  •  ...the storage strategy for k0rdent AI. In this role, you will define the roadmap and feature priorities for GPU cloud storage while collaborating with engineering and marketing teams. The ideal candidate should have over 5 years of experience in product management focused... 
    Cloud
    Remote job
    Full time

    Mirantis, Inc.

    New York, NY
    1 day ago
  • $190k - $230k

     ...Description Datavant is the data collaboration platform trusted for healthcare. Guided by our...  ...We're Looking For: As a Staff Data Engineer at Datavant, you will lead the design and...  ...use of data across a multi-tenant, multi-cloud environment. This is a hands-on technical... 
    Cloud

    Datavant

    New York, NY
    22 days ago
  •  ...of Openings: 1 Location: Remote Software Engineer III - 6-10 Years Experience Required Software...  ...and applications into the SCM-ERP platform. You will be responsible for designing and...  ...implement scalable, secure, and efficient cloud-based infrastructure for SCM-ERP applications... 
    Cloud
    Remote work

    Changeis

    New York, NY
    4 days ago
  • $150k - $300k

     ...Role:  Platform Engineer / DevOps Engineer – Trading Client: Elite FinTech Compensation: $150,000 - $300,000 + Bonus Location: New...  ...solutions for scalable deployment across private and public cloud infrastructure. Low Latency: Supporting and optimising a low... 
    Cloud
    Immediate start

    Hunter Bond Ltd

    New York, NY
    1 day ago
  •  ...technology firm located in New York is looking for a skilled DevOps/Platform Engineer to design and build a next-generation deployment platform....  ...Go or Rust, and in-depth knowledge of Kubernetes and multi-cloud environments like AWS or Azure. The role involves... 
    Cloud

    Sweya Information Technologies LLP

    New York, NY
    3 days ago
  •  ...modernization across accounting firms. This role involves operating multi-tenant deployments and integrating new firms onto the Modus platform. The ideal candidate will possess solid experience with AWS, Terraform, and scripting. Modus offers competitive compensation,... 
    Cloud

    Modus

    New York, NY
    3 days ago
  • $100k - $120k

    Senior Platform Operations Engineer We are looking for an experienced Senior Platform Operations Engineer to build, operate, and evolve our Azure...  ...improving developer experience, platform reliability, and cloud adoption. Responsibilities Design, implement, and operate... 
    Cloud
    Local area

    Press Ganey Associates LLC

    New York, NY
    2 days ago
  •  ...The engineering team at Chainalysis is inspired by solving the hardest technical challenges...  ...day and our job is to build a flexible platform that will allow us to adapt to those rapid...  ...Data Platform Engineer to join our Data Cloud team. This group accelerates innovation... 
    Cloud
    Flexible hours

    Chainalysis Inc.

    New York, NY
    17 hours ago
  • The Data Platform Engineer will be responsible for designing, implementing, and deploying scalable data solutions across diverse technical environments...  ...working across streaming, data lakes, analytics, and cloud platforms while ensuring strong performance, security,... 
    Cloud

    Compunnel, Inc.

    New York, NY
    17 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to GPU Cloud Platform Engineer. Be the first to apply!