Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

GPU Cloud Platform Engineer

Yotta Labs

Location: Remote (Global) Type: Full-time Company: Yotta Labs Apply: View email address on click.appcast.io About Yotta Labs Yotta Labs is pioneering the development of a Decentralized Operating System (DeOS) for AI workload orchestration at a planetary scale. Our mission is to democratize access to AI resources by aggregating geo-distributed GPUs, enabling high-performance computing for AI training and inference on a wide spectrum of hardware—from commodity to high-end GPUs. Our platform supports major large language models (LLMs) and offers customizable solutions for new models, facilitating elastic and efficient AI development. ️ Role Overview We are seeking a GPU Cloud Platform Engineer to join our core infrastructure team and help build the next-generation AI compute cloud. In this role, you will design, deploy, and operate large-scale, multi-cluster GPU infrastructure across data centers and cloud environments. You will be responsible for ensuring high availability, performance, and efficiency of containerized AI workloads—ranging from LLMs to generative models—deployed in Kubernetes-based GPU clusters. If you're passionate about high-performance systems, distributed orchestration, and scaling real-world AI infrastructure, this role offers a unique opportunity to shape the backbone of our AI cloud platform. Responsibilities Build and operate large-scale, high-performance GPU clusters; ensure stable operation of compute, network, and storage systems; monitor and troubleshoot online issues. Conduct performance testing and evaluation of multi-node GPU clusters using standard benchmarking tools to identify and resolve performance bottlenecks. Deploy and orchestrate large models (e.g., LLMs, video generation models) across multi-cluster environments using Kubernetes; implement elastic scaling and cross-cluster load balancing to ensure efficient service response under high concurrency for global users. Participate in the design, development, and iteration of GPU cluster scheduling and optimization systems. Define and lead Kubernetes multi-cluster configuration standards; Optimize scheduling strategies (e.g., node affinity, taints/tolerations) to improve GPU resource utilization. Build a unified multi-cluster management and monitoring system to support cross-region resource monitoring, traffic scheduling, and fault failover. Collect key metrics such as GPU memory usage, QPS, and response latency in real time; configure alert mechanisms. Coordinate with IDC providers for planning and deploying large-scale GPU clusters, networks, and storage infrastructure to support internal cloud platforms and external customer needs. ✅ Qualifications Bachelor's degree or higher in Computer Science, Software Engineering, Electronic Engineering, or related fields; 3+ years of experience in system engineering or DevOps. 5+ years of experience in cloud-native development or AI engineering, with at least 2 years of hands‑on experience in Kubernetes multi-cluster management and orchestration. Familiarity with the Kubernetes ecosystem; hands‑on experience with tools such as kubectl, Helm, and expertise in multi‑cluster deployment, upgrade, scaling, and disaster recovery. Proficient in Docker and containerization technologies; knowledge of image management and cross-cluster distribution. Experience with monitoring tools such as Prometheus and Grafana; Has practical experience in GPU fault monitoring and alerting. Hands‑on experience with cloud platforms such as AWS, GCP, or Azure; understanding of cloud-native multi-cluster architecture. Experience with cluster management tools such as Ray, Slurm, KubeSphere, Rancher, Karmada is a plus. Familiarity with distributed file systems such as NFS, JuiceFS, CephFS, or Lustre; ability to diagnose and resolve performance bottlenecks. Understanding of high-performance communication protocols such as IB, RoCE, NVLink, and PCIe. Strong communication skills, self‑motivation, and team collaboration Preferred Experience Experience in developing and operating MaaS platforms or large-scale model inference clusters. Proven track record of leading multi-cluster system development or performance optimization projects. Proficiency in CUDA programming and the NCCL communication library; understanding of high-performance GPUs like H100. Ability to develop standardized inference APIs (RESTful/gRPC) and automation tools using Golang or Python. Hands‑on experience with optimization techniques such as model quantization, static compilation, and multi‑GPU parallelism; capable of profiling inference processes in multi-cluster setups and identifying bottlenecks like memory fragmentation and low compute efficiency. Active engagement with open-source communities such as Hugging Face and GitHub; deep understanding of the design principles of inference frameworks like Triton, vLLM, and SGLang; ability to perform secondary development and optimization based on open-source projects and quickly translate cutting-edge techniques into production-ready multi-cluster solutions. Why Join Yotta Labs? Be part of a visionary team aiming to redefine AI infrastructure. Work on cutting-edge technologies that bridge AI and decentralized computing. Collaborate with experts from leading institutions and tech companies. Enjoy a flexible, remote work environment that values innovation and autonomy. How to Apply Interested candidates should apply directly or send their resume and a brief cover letter View email address on click.appcast.io. Please include links to any relevant projects or contributions. #J-18808-Ljbffr Yotta Labs

Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the GPU Cloud Platform Engineer in New York, NY vacancy
  • A pioneering AI infrastructure company is seeking a GPU Cloud Platform Engineer to design and operate large-scale GPU clusters. This remote position aims to ensure high availability and performance of containerized AI workloads across cloud environments. The ideal candidate... 
    Cloud
    Remote job

    Yotta Labs

    New York, NY
    4 days ago
  • Job Title: GPU Platform Infrastructure Engineer Job Summary Support the GM ARC RTD team by building and maintaining the foundational GPU cluster platform...  ...automation tools Experience with CI/CD pipelines and cloud platforms such as AWS, Azure, or GCP is a plus... 
    Cloud

    Optimal

    Brooklyn, NY
    2 days ago
  • Job Title: ML Platform Engineer - GPU Infrastructure Support team by designing, implementing, and maintaining the automation and ML workload enablement...  ...with Isaac Sim or simulation workloads Exposure to cloud platforms (AWS, Azure, or GCP) Knowledge of monitoring and... 
    Cloud

    Optimal

    Brooklyn, NY
    6 days ago
  •  ...in New York is seeking an experienced infrastructure engineer to build backend services and manage cloud infrastructure. The successful candidate will work...  ...Ideal candidates will have over 3 years' experience in platform engineering, proficiently using technologies like... 
    Cloud

    triomics inc.

    New York, NY
    4 days ago
  • A cutting-edge AI company in New York is seeking a skilled engineer to work on cluster management and GPU infrastructure. You will be responsible for building tools for monitoring and observability while collaborating closely with training teams. Ideal candidates have... 
    Cloud

    Reflection

    New York, NY
    6 days ago
  • A leading tech company in the United States is seeking an experienced Infrastructure GPU Engineer to build and support high-performance cloud infrastructure. This role involves optimizing resource allocation for GPU workloads, ensuring system reliability, and collaborating... 
    Cloud
    Remote job

    DevOpsChat

    New York, NY
    4 days ago
  • This role spans backend product engineering and infrastructure. You'll...  ...application features, and also own the cloud infrastructure, deployments,...  ...running in production. The platform processes millions of...  ...Triomics cloud environments, with GPU infrastructure serving AI... 
    Cloud
    Day shift

    triomics inc.

    New York, NY
    4 days ago
  • $115k - $140k

     ...mission to make high-performance cloud infrastructure easy to use,...  ...global Cloud Compute, Cloud GPU, Bare Metal, and Cloud...  ...GPU workloads on the company’s platform. This includes customers using...  ...NOC, and Product Management & Engineering to resolve high-urgency incidents... 
    Cloud
    Work at office
    Immediate start
    Remote work
    Flexible hours
    Night shift

    Vultr

    New York, NY
    4 days ago
  •  ..., primarily in architecture, engineering, and construction, extract structured...  ..., and project files. Our platform combines embedding models,...  ...agents execute in customer cloud environments. You’ll own the...  ...infrastructure inference services, GPU workloads, model serving,... 
    Cloud

    Nomic, Inc.

    New York, NY
    5 days ago
  • A leading AI infrastructure company in the United States is looking for a Support Desk Engineer to provide first-line technical support across their GPU cloud platform. Ideal candidates will possess strong troubleshooting skills, experience in cloud environments, and a... 
    Cloud
    Remote job

    Nscale

    New York, NY
    6 days ago
  • $170k - $220k

     ...with them. As our first dedicated ML Platform Engineer, you'll define the technical direction and...  ...today and are investing in hosted GPU inference to support the next generation...  ...infrastructure expertise ~ Familiarity with cloud ML services (AWS SageMaker, GCP Vertex... 
    Cloud
    Full time
    Work at office
    Local area

    Charlie Health Engineering, Product & Design

    New York, NY
    22 days ago
  • $150k - $215k

    Principal Observability Platform Engineer US Principal Observability Platform Engineer - Nscale About Nscale Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale simplifies... 
    Cloud
    Flexible hours

    Nscale

    New York, NY
    3 days ago
  •  ...radiology and AI diagnostics platform delivering 24/7 imaging insights...  ...-throughput medical imaging, GPU-backed inference, global distribution...  ...predictable, and easier for engineers to build on. Why This Role...  ...Infrastructure and Cloud Own and evolve Radimal’s AWS and... 
    Cloud
    Remote job
    Local area

    Radimal

    New York, NY
    4 days ago
  •  ...Hi, Hope you are doing well! This Is Parnika from Jconnect INC. Job Title: Cloud Platform Engineer Hybrid - 3 days a week in the office Weehawken, NJ Contract Project description ACQA is built on Microsoft... 
    Cloud
    Contract work
    Work at office
    3 days per week

    3B Staffing LLC

    Weehawken, NJ
    2 days ago
  •  ...foundation for AI teams. With instant GPU access, sub-second container...  ...medalists, and experienced engineering and product leaders with...  ...The Role: At Modal, we sell cloud services atop which our customers...  ...scaling the size of our platform and customer base. This role... 
    Cloud

    Modal Labs

    New York, NY
    4 days ago
  •  ...Design and build Clappit's next-generation deployment platform with intelligent rollbacks and multi-cloud orchestration. Go / Rust, Terraform, Monitoring....  ...LinkedIn Profile Requirements 4+ years of DevOps/Platform engineering experience Proficiency in Go or Rust programming... 
    Cloud

    Sweya Information Technologies LLP

    New York, NY
    5 days ago
  • Optimal is seeking a GPU Platform Infrastructure Engineer to support the GM ARC RTD team by building and maintaining the foundational GPU cluster platform infrastructure. This role focuses on GPU access governance, resource allocation, scheduling policies, and operational... 

    Optimal

    Brooklyn, NY
    5 days ago
  • Orangepeople is seeking a Data Platform Engineer in the United States to support large-scale cloud data platforms and governance initiatives. In this role, you will collaborate closely with engineering and architecture teams to implement and support modern AWS and Databricks... 
    Cloud

    Orangepeople

    New York, NY
    2 days ago
  • $170k - $297.2k

     ...Overview Grata is a private‐market deal‐making platform that helps companies find and engage...  ...companies. The Director of Platform Engineering will lead the foundational systems that...  ...retention of senior engineers. Expertise in cloud‐native architectures (AWS, GCP, or Azure... 
    Cloud
    Work at office
    Flexible hours

    DataSite

    New York, NY
    5 days ago
  • $150k - $300k

     ...Role: Platform Engineer / DevOps Engineer – Trading Client: Elite FinTech Compensation: $150,000 - $300,000 + Bonus Location: New York Overview...  ...solutions for scalable deployment across private and public cloud infrastructure. Low Latency: Supporting and optimising a low... 
    Cloud
    Immediate start

    Hunter Bond Ltd

    New York, NY
    5 days ago
  • $40 - $60 per hour

     ...Job Overview We are seeking a skilled Platform Engineer with 5+ years of expertise in Azure, Snowflake, and Drools rules engine to design, build, and manage a robust data and application platform. This role will involve creating scalable solutions for data integration,... 
    Cloud
    Hourly pay

    CeDent

    New York, NY
    1 day ago
  • Capgemini, located in New York, NY, seeks a Senior Data Architect. This hybrid role involves developing architectural patterns using Databricks and Azure services, ensuring robust data layer performance. The ideal candidate has 10-15 years of experience in data architecture...
    Cloud

    Capgemini

    New York, NY
    5 days ago
  •  ...Container-based technologies Experience in any of the following cloud service providers - GCP, Azure or AWS. Sound experience with...  ...infrastructure and cloud computing Skills Desired 6+ years of Overall Engineering experience 4+ years of experience working in AWS, Azure or GCE... 
    Cloud

    ALLTECH CONSULTING SVC INC

    New York, NY
    4 days ago
  •  ...to drive strategic MDM integration and cloud data architecture initiatives. In this role...  ...for managing high-performing engineering teams while spearheading the migration of...  ...engineering, advanced knowledge of data platforms like Databricks, and strong people management... 
    Cloud

    S&P Global, Inc.

    New York, NY
    3 days ago
  • $126.6k - $180k

     ...We are seeking a high-caliber DevOps/Platform Engineer to join our Global Platform team. This role serves a dual purpose: contributing to the...  ...is a technical practitioner who balances deep expertise in cloud-native infrastructure with a forward-thinking approach to automation... 
    Cloud
    Summer work
    Remote work
    Flexible hours

    PVcase

    New York, NY
    4 days ago
  •  ...analyzing benchmarks, diagnosing performance issues, and communicating findings effectively. Ideal candidates should be knowledgeable in GPU cloud infrastructure and able to manage multiple trials simultaneously. The role is critical for ensuring customer trust and optimizing... 
    Cloud

    Hyperbolic Labs

    New York, NY
    4 days ago
  • Senior Platform Engineer Why this Role Matters: At Greenbox Capital, we help small businesses thrive by giving them fast, accessible funding...  ...or migration initiatives Experience working in cloud-native environments (preferably Azure) Experience designing... 
    Cloud
    Remote work
    Flexible hours

    Greenbox Capital

    New York, NY
    4 days ago
  • $200k

     ...Our client is seeking a Senior Cloud Platform Engineer to lead the design, development, and evolution of cloud infrastructure across a large-scale enterprise environment. This role combines hands-on engineering with team leadership, driving cloud strategy, automation... 
    Cloud

    The Right Click, Inc.

    New York, NY
    3 days ago
  • $115k - $140k

    A leading cloud infrastructure company seeks a Technical Account Manager to oversee the technical success of customers deploying GPU workloads. The role involves advising on GPU cluster design, optimizing performance, and ensuring cost-effective solutions. Requirements... 
    Cloud

    Vultr

    New York, NY
    4 days ago
  • $150k - $250k

     ...A leading infrastructure firm is seeking a Senior Cloud Engineer to architect and manage its public cloud environments, mentor junior team members, and automate deployment processes. Ideal candidates have over 7 years of experience with public cloud services, particularly... 
    Cloud

    Benton Partners

    New York, NY
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to GPU Cloud Platform Engineer. Be the first to apply!