GPU Cloud Platform Engineer
Yotta Labs
Location: Remote (Global) Type: Full-time Company: Yotta Labs Apply: View email address on click.appcast.io About Yotta Labs Yotta Labs is pioneering the development of a Decentralized Operating System (DeOS) for AI workload orchestration at a planetary scale. Our mission is to democratize access to AI resources by aggregating geo-distributed GPUs, enabling high-performance computing for AI training and inference on a wide spectrum of hardware—from commodity to high-end GPUs. Our platform supports major large language models (LLMs) and offers customizable solutions for new models, facilitating elastic and efficient AI development. ️ Role Overview We are seeking a GPU Cloud Platform Engineer to join our core infrastructure team and help build the next-generation AI compute cloud. In this role, you will design, deploy, and operate large-scale, multi-cluster GPU infrastructure across data centers and cloud environments. You will be responsible for ensuring high availability, performance, and efficiency of containerized AI workloads—ranging from LLMs to generative models—deployed in Kubernetes-based GPU clusters. If you're passionate about high-performance systems, distributed orchestration, and scaling real-world AI infrastructure, this role offers a unique opportunity to shape the backbone of our AI cloud platform. Responsibilities Build and operate large-scale, high-performance GPU clusters; ensure stable operation of compute, network, and storage systems; monitor and troubleshoot online issues. Conduct performance testing and evaluation of multi-node GPU clusters using standard benchmarking tools to identify and resolve performance bottlenecks. Deploy and orchestrate large models (e.g., LLMs, video generation models) across multi-cluster environments using Kubernetes; implement elastic scaling and cross-cluster load balancing to ensure efficient service response under high concurrency for global users. Participate in the design, development, and iteration of GPU cluster scheduling and optimization systems. Define and lead Kubernetes multi-cluster configuration standards; Optimize scheduling strategies (e.g., node affinity, taints/tolerations) to improve GPU resource utilization. Build a unified multi-cluster management and monitoring system to support cross-region resource monitoring, traffic scheduling, and fault failover. Collect key metrics such as GPU memory usage, QPS, and response latency in real time; configure alert mechanisms. Coordinate with IDC providers for planning and deploying large-scale GPU clusters, networks, and storage infrastructure to support internal cloud platforms and external customer needs. ✅ Qualifications Bachelor's degree or higher in Computer Science, Software Engineering, Electronic Engineering, or related fields; 3+ years of experience in system engineering or DevOps. 5+ years of experience in cloud-native development or AI engineering, with at least 2 years of hands‑on experience in Kubernetes multi-cluster management and orchestration. Familiarity with the Kubernetes ecosystem; hands‑on experience with tools such as kubectl, Helm, and expertise in multi‑cluster deployment, upgrade, scaling, and disaster recovery. Proficient in Docker and containerization technologies; knowledge of image management and cross-cluster distribution. Experience with monitoring tools such as Prometheus and Grafana; Has practical experience in GPU fault monitoring and alerting. Hands‑on experience with cloud platforms such as AWS, GCP, or Azure; understanding of cloud-native multi-cluster architecture. Experience with cluster management tools such as Ray, Slurm, KubeSphere, Rancher, Karmada is a plus. Familiarity with distributed file systems such as NFS, JuiceFS, CephFS, or Lustre; ability to diagnose and resolve performance bottlenecks. Understanding of high-performance communication protocols such as IB, RoCE, NVLink, and PCIe. Strong communication skills, self‑motivation, and team collaboration Preferred Experience Experience in developing and operating MaaS platforms or large-scale model inference clusters. Proven track record of leading multi-cluster system development or performance optimization projects. Proficiency in CUDA programming and the NCCL communication library; understanding of high-performance GPUs like H100. Ability to develop standardized inference APIs (RESTful/gRPC) and automation tools using Golang or Python. Hands‑on experience with optimization techniques such as model quantization, static compilation, and multi‑GPU parallelism; capable of profiling inference processes in multi-cluster setups and identifying bottlenecks like memory fragmentation and low compute efficiency. Active engagement with open-source communities such as Hugging Face and GitHub; deep understanding of the design principles of inference frameworks like Triton, vLLM, and SGLang; ability to perform secondary development and optimization based on open-source projects and quickly translate cutting-edge techniques into production-ready multi-cluster solutions. Why Join Yotta Labs? Be part of a visionary team aiming to redefine AI infrastructure. Work on cutting-edge technologies that bridge AI and decentralized computing. Collaborate with experts from leading institutions and tech companies. Enjoy a flexible, remote work environment that values innovation and autonomy. How to Apply Interested candidates should apply directly or send their resume and a brief cover letter View email address on click.appcast.io. Please include links to any relevant projects or contributions. #J-18808-Ljbffr Yotta Labs
- A pioneering AI infrastructure company is seeking a GPU Cloud Platform Engineer to design and operate large-scale GPU clusters. This remote position aims to ensure high availability and performance of containerized AI workloads across cloud environments. The ideal candidate...CloudRemote job
- Job Title: GPU Platform Infrastructure Engineer Job Summary Support the GM ARC RTD team by building and maintaining the foundational GPU cluster platform... ...automation tools Experience with CI/CD pipelines and cloud platforms such as AWS, Azure, or GCP is a plus...Cloud
- Job Title: ML Platform Engineer - GPU Infrastructure Support team by designing, implementing, and maintaining the automation and ML workload enablement... ...with Isaac Sim or simulation workloads Exposure to cloud platforms (AWS, Azure, or GCP) Knowledge of monitoring and...Cloud
- ...in New York is seeking an experienced infrastructure engineer to build backend services and manage cloud infrastructure. The successful candidate will work... ...Ideal candidates will have over 3 years' experience in platform engineering, proficiently using technologies like...Cloud
- A cutting-edge AI company in New York is seeking a skilled engineer to work on cluster management and GPU infrastructure. You will be responsible for building tools for monitoring and observability while collaborating closely with training teams. Ideal candidates have...Cloud
- A leading tech company in the United States is seeking an experienced Infrastructure GPU Engineer to build and support high-performance cloud infrastructure. This role involves optimizing resource allocation for GPU workloads, ensuring system reliability, and collaborating...CloudRemote job
- This role spans backend product engineering and infrastructure. You'll... ...application features, and also own the cloud infrastructure, deployments,... ...running in production. The platform processes millions of... ...Triomics cloud environments, with GPU infrastructure serving AI...CloudDay shift
$115k - $140k
...mission to make high-performance cloud infrastructure easy to use,... ...global Cloud Compute, Cloud GPU, Bare Metal, and Cloud... ...GPU workloads on the company’s platform. This includes customers using... ...NOC, and Product Management & Engineering to resolve high-urgency incidents...CloudWork at officeImmediate startRemote workFlexible hoursNight shift- ..., primarily in architecture, engineering, and construction, extract structured... ..., and project files. Our platform combines embedding models,... ...agents execute in customer cloud environments. You’ll own the... ...infrastructure inference services, GPU workloads, model serving,...Cloud
- A leading AI infrastructure company in the United States is looking for a Support Desk Engineer to provide first-line technical support across their GPU cloud platform. Ideal candidates will possess strong troubleshooting skills, experience in cloud environments, and a...CloudRemote job
$170k - $220k
...with them. As our first dedicated ML Platform Engineer, you'll define the technical direction and... ...today and are investing in hosted GPU inference to support the next generation... ...infrastructure expertise ~ Familiarity with cloud ML services (AWS SageMaker, GCP Vertex...CloudFull timeWork at officeLocal area$150k - $215k
Principal Observability Platform Engineer US Principal Observability Platform Engineer - Nscale About Nscale Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale simplifies...CloudFlexible hours- ...radiology and AI diagnostics platform delivering 24/7 imaging insights... ...-throughput medical imaging, GPU-backed inference, global distribution... ...predictable, and easier for engineers to build on. Why This Role... ...Infrastructure and Cloud Own and evolve Radimal’s AWS and...CloudRemote jobLocal area
- ...Hi, Hope you are doing well! This Is Parnika from Jconnect INC. Job Title: Cloud Platform Engineer Hybrid - 3 days a week in the office Weehawken, NJ Contract Project description ACQA is built on Microsoft...CloudContract workWork at office3 days per week
- ...foundation for AI teams. With instant GPU access, sub-second container... ...medalists, and experienced engineering and product leaders with... ...The Role: At Modal, we sell cloud services atop which our customers... ...scaling the size of our platform and customer base. This role...Cloud
- ...Design and build Clappit's next-generation deployment platform with intelligent rollbacks and multi-cloud orchestration. Go / Rust, Terraform, Monitoring.... ...LinkedIn Profile Requirements 4+ years of DevOps/Platform engineering experience Proficiency in Go or Rust programming...Cloud
- Optimal is seeking a GPU Platform Infrastructure Engineer to support the GM ARC RTD team by building and maintaining the foundational GPU cluster platform infrastructure. This role focuses on GPU access governance, resource allocation, scheduling policies, and operational...
- Orangepeople is seeking a Data Platform Engineer in the United States to support large-scale cloud data platforms and governance initiatives. In this role, you will collaborate closely with engineering and architecture teams to implement and support modern AWS and Databricks...Cloud
$170k - $297.2k
...Overview Grata is a private‐market deal‐making platform that helps companies find and engage... ...companies. The Director of Platform Engineering will lead the foundational systems that... ...retention of senior engineers. Expertise in cloud‐native architectures (AWS, GCP, or Azure...CloudWork at officeFlexible hours$150k - $300k
...Role: Platform Engineer / DevOps Engineer – Trading Client: Elite FinTech Compensation: $150,000 - $300,000 + Bonus Location: New York Overview... ...solutions for scalable deployment across private and public cloud infrastructure. Low Latency: Supporting and optimising a low...CloudImmediate start$40 - $60 per hour
...Job Overview We are seeking a skilled Platform Engineer with 5+ years of expertise in Azure, Snowflake, and Drools rules engine to design, build, and manage a robust data and application platform. This role will involve creating scalable solutions for data integration,...CloudHourly pay- Capgemini, located in New York, NY, seeks a Senior Data Architect. This hybrid role involves developing architectural patterns using Databricks and Azure services, ensuring robust data layer performance. The ideal candidate has 10-15 years of experience in data architecture...Cloud
- ...Container-based technologies Experience in any of the following cloud service providers - GCP, Azure or AWS. Sound experience with... ...infrastructure and cloud computing Skills Desired 6+ years of Overall Engineering experience 4+ years of experience working in AWS, Azure or GCE...Cloud
- ...to drive strategic MDM integration and cloud data architecture initiatives. In this role... ...for managing high-performing engineering teams while spearheading the migration of... ...engineering, advanced knowledge of data platforms like Databricks, and strong people management...Cloud
$126.6k - $180k
...We are seeking a high-caliber DevOps/Platform Engineer to join our Global Platform team. This role serves a dual purpose: contributing to the... ...is a technical practitioner who balances deep expertise in cloud-native infrastructure with a forward-thinking approach to automation...CloudSummer workRemote workFlexible hours- ...analyzing benchmarks, diagnosing performance issues, and communicating findings effectively. Ideal candidates should be knowledgeable in GPU cloud infrastructure and able to manage multiple trials simultaneously. The role is critical for ensuring customer trust and optimizing...Cloud
- Senior Platform Engineer Why this Role Matters: At Greenbox Capital, we help small businesses thrive by giving them fast, accessible funding... ...or migration initiatives Experience working in cloud-native environments (preferably Azure) Experience designing...CloudRemote workFlexible hours
$200k
...Our client is seeking a Senior Cloud Platform Engineer to lead the design, development, and evolution of cloud infrastructure across a large-scale enterprise environment. This role combines hands-on engineering with team leadership, driving cloud strategy, automation...Cloud$115k - $140k
A leading cloud infrastructure company seeks a Technical Account Manager to oversee the technical success of customers deploying GPU workloads. The role involves advising on GPU cluster design, optimizing performance, and ensuring cost-effective solutions. Requirements...Cloud$150k - $250k
...A leading infrastructure firm is seeking a Senior Cloud Engineer to architect and manage its public cloud environments, mentor junior team members, and automate deployment processes. Ideal candidates have over 7 years of experience with public cloud services, particularly...Cloud
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to GPU Cloud Platform Engineer. Be the first to apply!
- cloud engineering manager New York, NY
- informatica cloud developer New York, NY
- azure cloud solution architect New York, NY
- senior cloud data engineer New York, NY
- java cloud engineer New York, NY
- cloud engineer New York, NY
- java cloud developer New York, NY
- senior devops cloud engineer New York, NY
- cloud security architect New York, NY
- graduate cloud engineer New York, NY


