GPU Cloud Platform Engineer

Yotta Labs

Location: Remote (Global) Type: Full-time Company: Yotta Labs Apply: View email address on click.appcast.io About Yotta Labs Yotta Labs is pioneering the development of a Decentralized Operating System (DeOS) for AI workload orchestration at a planetary scale. Our mission is to democratize access to AI resources by aggregating geo-distributed GPUs, enabling high-performance computing for AI training and inference on a wide spectrum of hardware—from commodity to high-end GPUs. Our platform supports major large language models (LLMs) and offers customizable solutions for new models, facilitating elastic and efficient AI development. ️ Role Overview We are seeking a GPU Cloud Platform Engineer to join our core infrastructure team and help build the next-generation AI compute cloud. In this role, you will design, deploy, and operate large-scale, multi-cluster GPU infrastructure across data centers and cloud environments. You will be responsible for ensuring high availability, performance, and efficiency of containerized AI workloads—ranging from LLMs to generative models—deployed in Kubernetes-based GPU clusters. If you're passionate about high-performance systems, distributed orchestration, and scaling real-world AI infrastructure, this role offers a unique opportunity to shape the backbone of our AI cloud platform. Responsibilities Build and operate large-scale, high-performance GPU clusters; ensure stable operation of compute, network, and storage systems; monitor and troubleshoot online issues. Conduct performance testing and evaluation of multi-node GPU clusters using standard benchmarking tools to identify and resolve performance bottlenecks. Deploy and orchestrate large models (e.g., LLMs, video generation models) across multi-cluster environments using Kubernetes; implement elastic scaling and cross-cluster load balancing to ensure efficient service response under high concurrency for global users. Participate in the design, development, and iteration of GPU cluster scheduling and optimization systems. Define and lead Kubernetes multi-cluster configuration standards; Optimize scheduling strategies (e.g., node affinity, taints/tolerations) to improve GPU resource utilization. Build a unified multi-cluster management and monitoring system to support cross-region resource monitoring, traffic scheduling, and fault failover. Collect key metrics such as GPU memory usage, QPS, and response latency in real time; configure alert mechanisms. Coordinate with IDC providers for planning and deploying large-scale GPU clusters, networks, and storage infrastructure to support internal cloud platforms and external customer needs. ✅ Qualifications Bachelor's degree or higher in Computer Science, Software Engineering, Electronic Engineering, or related fields; 3+ years of experience in system engineering or DevOps. 5+ years of experience in cloud-native development or AI engineering, with at least 2 years of hands‑on experience in Kubernetes multi-cluster management and orchestration. Familiarity with the Kubernetes ecosystem; hands‑on experience with tools such as kubectl, Helm, and expertise in multi‑cluster deployment, upgrade, scaling, and disaster recovery. Proficient in Docker and containerization technologies; knowledge of image management and cross-cluster distribution. Experience with monitoring tools such as Prometheus and Grafana; Has practical experience in GPU fault monitoring and alerting. Hands‑on experience with cloud platforms such as AWS, GCP, or Azure; understanding of cloud-native multi-cluster architecture. Experience with cluster management tools such as Ray, Slurm, KubeSphere, Rancher, Karmada is a plus. Familiarity with distributed file systems such as NFS, JuiceFS, CephFS, or Lustre; ability to diagnose and resolve performance bottlenecks. Understanding of high-performance communication protocols such as IB, RoCE, NVLink, and PCIe. Strong communication skills, self‑motivation, and team collaboration Preferred Experience Experience in developing and operating MaaS platforms or large-scale model inference clusters. Proven track record of leading multi-cluster system development or performance optimization projects. Proficiency in CUDA programming and the NCCL communication library; understanding of high-performance GPUs like H100. Ability to develop standardized inference APIs (RESTful/gRPC) and automation tools using Golang or Python. Hands‑on experience with optimization techniques such as model quantization, static compilation, and multi‑GPU parallelism; capable of profiling inference processes in multi-cluster setups and identifying bottlenecks like memory fragmentation and low compute efficiency. Active engagement with open-source communities such as Hugging Face and GitHub; deep understanding of the design principles of inference frameworks like Triton, vLLM, and SGLang; ability to perform secondary development and optimization based on open-source projects and quickly translate cutting-edge techniques into production-ready multi-cluster solutions. Why Join Yotta Labs? Be part of a visionary team aiming to redefine AI infrastructure. Work on cutting-edge technologies that bridge AI and decentralized computing. Collaborate with experts from leading institutions and tech companies. Enjoy a flexible, remote work environment that values innovation and autonomy. How to Apply Interested candidates should apply directly or send their resume and a brief cover letter View email address on click.appcast.io. Please include links to any relevant projects or contributions. #J-18808-Ljbffr Yotta Labs

Apply

Vacancy posted 4 days ago

Similar jobs that could be interesting for youBased on the GPU Cloud Platform Engineer in New York, NY vacancy

Remote GPU Cloud Platform Engineer: Scale AI Compute
A pioneering AI infrastructure company is seeking a GPU Cloud Platform Engineer to design and operate large-scale GPU clusters. This remote position aims to ensure high availability and performance of containerized AI workloads across cloud environments. The ideal candidate...
Cloud
Remote job
Yotta Labs
New York, NY
4 days ago
GPU Platform Infrastructure Engineer
Job Title: GPU Platform Infrastructure Engineer Job Summary Support the GM ARC RTD team by building and maintaining the foundational GPU cluster platform... ...automation tools Experience with CI/CD pipelines and cloud platforms such as AWS, Azure, or GCP is a plus...
Cloud
Optimal
Brooklyn, NY
2 days ago
ML Platform Engineer - GPU Infrastructure
Job Title: ML Platform Engineer - GPU Infrastructure Support team by designing, implementing, and maintaining the automation and ML workload enablement... ...with Isaac Sim or simulation workloads Exposure to cloud platforms (AWS, Azure, or GCP) Knowledge of monitoring and...
Cloud
Optimal
Brooklyn, NY
6 days ago
Platform Engineer: Kubernetes, GPU & Cloud Infra
...in New York is seeking an experienced infrastructure engineer to build backend services and manage cloud infrastructure. The successful candidate will work... ...Ideal candidates will have over 3 years' experience in platform engineering, proficiently using technologies like...
Cloud
triomics inc.
New York, NY
4 days ago
Staff Compute Platform Engineer: K8s, GPUs & Multi-Cloud
A cutting-edge AI company in New York is seeking a skilled engineer to work on cluster management and GPU infrastructure. You will be responsible for building tools for monitoring and observability while collaborating closely with training teams. Ideal candidates have...
Cloud
Reflection
New York, NY
6 days ago
Remote Cloud GPU Infrastructure Engineer
A leading tech company in the United States is seeking an experienced Infrastructure GPU Engineer to build and support high-performance cloud infrastructure. This role involves optimizing resource allocation for GPU workloads, ensuring system reliability, and collaborating...
Cloud
Remote job
DevOpsChat
New York, NY
4 days ago
Platform Engineer
This role spans backend product engineering and infrastructure. You'll... ...application features, and also own the cloud infrastructure, deployments,... ...running in production. The platform processes millions of... ...Triomics cloud environments, with GPU infrastructure serving AI...
Cloud
Day shift
triomics inc.
New York, NY
4 days ago
Strategic Technical Account Manager GPU
$115k - $140k
...mission to make high-performance cloud infrastructure easy to use,... ...global Cloud Compute, Cloud GPU, Bare Metal, and Cloud... ...GPU workloads on the company’s platform. This includes customers using... ...NOC, and Product Management & Engineering to resolve high-urgency incidents...
Cloud
Work at office
Immediate start
Remote work
Flexible hours
Night shift
Vultr
New York, NY
4 days ago
Senior Platform Engineer
..., primarily in architecture, engineering, and construction, extract structured... ..., and project files. Our platform combines embedding models,... ...agents execute in customer cloud environments. You’ll own the... ...infrastructure inference services, GPU workloads, model serving,...
Cloud
Nomic, Inc.
New York, NY
5 days ago
Remote AI Cloud Support Engineer (GPU Infra)
A leading AI infrastructure company in the United States is looking for a Support Desk Engineer to provide first-line technical support across their GPU cloud platform. Ideal candidates will possess strong troubleshooting skills, experience in cloud environments, and a...
Cloud
Remote job
Nscale
New York, NY
6 days ago
Senior Machine Learning Platform Engineer
$170k - $220k
...with them. As our first dedicated ML Platform Engineer, you'll define the technical direction and... ...today and are investing in hosted GPU inference to support the next generation... ...infrastructure expertise ~ Familiarity with cloud ML services (AWS SageMaker, GCP Vertex...
Cloud
Full time
Work at office
Local area
Charlie Health Engineering, Product & Design
New York, NY
22 days ago
Principal Observability Platform Engineer
$150k - $215k
Principal Observability Platform Engineer US Principal Observability Platform Engineer - Nscale About Nscale Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale simplifies...
Cloud
Flexible hours
Nscale
New York, NY
3 days ago
Staff Platform Engineer (Remote)
...radiology and AI diagnostics platform delivering 24/7 imaging insights... ...-throughput medical imaging, GPU-backed inference, global distribution... ...predictable, and easier for engineers to build on. Why This Role... ...Infrastructure and Cloud Own and evolve Radimal’s AWS and...
Cloud
Remote job
Local area
Radimal
New York, NY
4 days ago
Cloud Platform Engineer
...Hi, Hope you are doing well! This Is Parnika from Jconnect INC. Job Title: Cloud Platform Engineer Hybrid - 3 days a week in the office Weehawken, NJ Contract Project description ACQA is built on Microsoft...
Cloud
Contract work
Work at office
3 days per week
3B Staffing LLC
Weehawken, NJ
2 days ago
Platform Reliability Engineer - On-Call Leader
...foundation for AI teams. With instant GPU access, sub-second container... ...medalists, and experienced engineering and product leaders with... ...The Role: At Modal, we sell cloud services atop which our customers... ...scaling the size of our platform and customer base. This role...
Cloud
Modal Labs
New York, NY
4 days ago
DevOps Platform Engineer
...Design and build Clappit's next-generation deployment platform with intelligent rollbacks and multi-cloud orchestration. Go / Rust, Terraform, Monitoring.... ...LinkedIn Profile Requirements 4+ years of DevOps/Platform engineering experience Proficiency in Go or Rust programming...
Cloud
Sweya Information Technologies LLP
New York, NY
5 days ago
GPU Platform Engineer: Multi-Tenant AI/ML Infra
Optimal is seeking a GPU Platform Infrastructure Engineer to support the GM ARC RTD team by building and maintaining the foundational GPU cluster platform infrastructure. This role focuses on GPU access governance, resource allocation, scheduling policies, and operational...
Optimal
Brooklyn, NY
5 days ago
Cloud Data Platform Engineer: AWS, Databricks & Governance
Orangepeople is seeking a Data Platform Engineer in the United States to support large-scale cloud data platforms and governance initiatives. In this role, you will collaborate closely with engineering and architecture teams to implement and support modern AWS and Databricks...
Cloud
Orangepeople
New York, NY
2 days ago
Director of Platform Engineering
$170k - $297.2k
...Overview Grata is a private‐market deal‐making platform that helps companies find and engage... ...companies. The Director of Platform Engineering will lead the foundational systems that... ...retention of senior engineers. Expertise in cloud‐native architectures (AWS, GCP, or Azure...
Cloud
Work at office
Flexible hours
DataSite
New York, NY
5 days ago
Platform Engineer / DevOps Engineer - Trading
$150k - $300k
...Role: Platform Engineer / DevOps Engineer – Trading Client: Elite FinTech Compensation: $150,000 - $300,000 + Bonus Location: New York Overview... ...solutions for scalable deployment across private and public cloud infrastructure. Low Latency: Supporting and optimising a low...
Cloud
Immediate start
Hunter Bond Ltd
New York, NY
5 days ago
Cloud Platform Engineer
$40 - $60 per hour
...Job Overview We are seeking a skilled Platform Engineer with 5+ years of expertise in Azure, Snowflake, and Drools rules engine to design, build, and manage a robust data and application platform. This role will involve creating scalable solutions for data integration,...
Cloud
Hourly pay
CeDent
New York, NY
1 day ago
Platform Engineer: Data Lakehouse & Cloud (Hybrid NYC)
Capgemini, located in New York, NY, seeks a Senior Data Architect. This hybrid role involves developing architectural patterns using Databricks and Azure services, ensuring robust data layer performance. The ideal candidate has 10-15 years of experience in data architecture...
Cloud
Capgemini
New York, NY
5 days ago
Cloud Platforms Engineer
...Container-based technologies Experience in any of the following cloud service providers - GCP, Azure or AWS. Sound experience with... ...infrastructure and cloud computing Skills Desired 6+ years of Overall Engineering experience 4+ years of experience working in AWS, Azure or GCE...
Cloud
ALLTECH CONSULTING SVC INC
New York, NY
4 days ago
Senior Director, Data Platforms & Cloud Engineering
...to drive strategic MDM integration and cloud data architecture initiatives. In this role... ...for managing high-performing engineering teams while spearheading the migration of... ...engineering, advanced knowledge of data platforms like Databricks, and strong people management...
Cloud
S&P Global, Inc.
New York, NY
3 days ago
Platform DevOps Engineer
$126.6k - $180k
...We are seeking a high-caliber DevOps/Platform Engineer to join our Global Platform team. This role serves a dual purpose: contributing to the... ...is a technical practitioner who balances deep expertise in cloud-native infrastructure with a forward-thinking approach to automation...
Cloud
Summer work
Remote work
Flexible hours
PVcase
New York, NY
4 days ago
GPU Infra & AI Benchmarking Engineer
...analyzing benchmarks, diagnosing performance issues, and communicating findings effectively. Ideal candidates should be knowledgeable in GPU cloud infrastructure and able to manage multiple trials simultaneously. The role is critical for ensuring customer trust and optimizing...
Cloud
Hyperbolic Labs
New York, NY
4 days ago
Senior Platform Engineer
Senior Platform Engineer Why this Role Matters: At Greenbox Capital, we help small businesses thrive by giving them fast, accessible funding... ...or migration initiatives Experience working in cloud-native environments (preferably Azure) Experience designing...
Cloud
Remote work
Flexible hours
Greenbox Capital
New York, NY
4 days ago
Senior Cloud Platform Engineer
$200k
...Our client is seeking a Senior Cloud Platform Engineer to lead the design, development, and evolution of cloud infrastructure across a large-scale enterprise environment. This role combines hands-on engineering with team leadership, driving cloud strategy, automation...
Cloud
The Right Click, Inc.
New York, NY
3 days ago
GPU TAM: AI Infra Architect & Onboarding Lead
$115k - $140k
A leading cloud infrastructure company seeks a Technical Account Manager to oversee the technical success of customers deploying GPU workloads. The role involves advising on GPU cluster design, optimizing performance, and ensuring cost-effective solutions. Requirements...
Cloud
Vultr
New York, NY
4 days ago
Senior Cloud Platform Engineer
$150k - $250k
...A leading infrastructure firm is seeking a Senior Cloud Engineer to architect and manage its public cloud environments, mentor junior team members, and automate deployment processes. Ideal candidates have over 7 years of experience with public cloud services, particularly...
Cloud
Benton Partners
New York, NY
1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to GPU Cloud Platform Engineer. Be the first to apply!