GPU Cloud Platform Engineer
Yotta Labs
Location: Remote (Global) Type: Full-time Company: Yotta Labs Apply: View email address on click.appcast.io About Yotta Labs Yotta Labs is pioneering the development of a Decentralized Operating System (DeOS) for AI workload orchestration at a planetary scale. Our mission is to democratize access to AI resources by aggregating geo-distributed GPUs, enabling high-performance computing for AI training and inference on a wide spectrum of hardware—from commodity to high-end GPUs. Our platform supports major large language models (LLMs) and offers customizable solutions for new models, facilitating elastic and efficient AI development. ️ Role Overview We are seeking a GPU Cloud Platform Engineer to join our core infrastructure team and help build the next-generation AI compute cloud. In this role, you will design, deploy, and operate large-scale, multi-cluster GPU infrastructure across data centers and cloud environments. You will be responsible for ensuring high availability, performance, and efficiency of containerized AI workloads—ranging from LLMs to generative models—deployed in Kubernetes-based GPU clusters. If you're passionate about high-performance systems, distributed orchestration, and scaling real-world AI infrastructure, this role offers a unique opportunity to shape the backbone of our AI cloud platform. Responsibilities Build and operate large-scale, high-performance GPU clusters; ensure stable operation of compute, network, and storage systems; monitor and troubleshoot online issues. Conduct performance testing and evaluation of multi-node GPU clusters using standard benchmarking tools to identify and resolve performance bottlenecks. Deploy and orchestrate large models (e.g., LLMs, video generation models) across multi-cluster environments using Kubernetes; implement elastic scaling and cross-cluster load balancing to ensure efficient service response under high concurrency for global users. Participate in the design, development, and iteration of GPU cluster scheduling and optimization systems. Define and lead Kubernetes multi-cluster configuration standards; Optimize scheduling strategies (e.g., node affinity, taints/tolerations) to improve GPU resource utilization. Build a unified multi-cluster management and monitoring system to support cross-region resource monitoring, traffic scheduling, and fault failover. Collect key metrics such as GPU memory usage, QPS, and response latency in real time; configure alert mechanisms. Coordinate with IDC providers for planning and deploying large-scale GPU clusters, networks, and storage infrastructure to support internal cloud platforms and external customer needs. ✅ Qualifications Bachelor's degree or higher in Computer Science, Software Engineering, Electronic Engineering, or related fields; 3+ years of experience in system engineering or DevOps. 5+ years of experience in cloud-native development or AI engineering, with at least 2 years of hands‑on experience in Kubernetes multi-cluster management and orchestration. Familiarity with the Kubernetes ecosystem; hands‑on experience with tools such as kubectl, Helm, and expertise in multi‑cluster deployment, upgrade, scaling, and disaster recovery. Proficient in Docker and containerization technologies; knowledge of image management and cross-cluster distribution. Experience with monitoring tools such as Prometheus and Grafana; Has practical experience in GPU fault monitoring and alerting. Hands‑on experience with cloud platforms such as AWS, GCP, or Azure; understanding of cloud-native multi-cluster architecture. Experience with cluster management tools such as Ray, Slurm, KubeSphere, Rancher, Karmada is a plus. Familiarity with distributed file systems such as NFS, JuiceFS, CephFS, or Lustre; ability to diagnose and resolve performance bottlenecks. Understanding of high-performance communication protocols such as IB, RoCE, NVLink, and PCIe. Strong communication skills, self‑motivation, and team collaboration Preferred Experience Experience in developing and operating MaaS platforms or large-scale model inference clusters. Proven track record of leading multi-cluster system development or performance optimization projects. Proficiency in CUDA programming and the NCCL communication library; understanding of high-performance GPUs like H100. Ability to develop standardized inference APIs (RESTful/gRPC) and automation tools using Golang or Python. Hands‑on experience with optimization techniques such as model quantization, static compilation, and multi‑GPU parallelism; capable of profiling inference processes in multi-cluster setups and identifying bottlenecks like memory fragmentation and low compute efficiency. Active engagement with open-source communities such as Hugging Face and GitHub; deep understanding of the design principles of inference frameworks like Triton, vLLM, and SGLang; ability to perform secondary development and optimization based on open-source projects and quickly translate cutting-edge techniques into production-ready multi-cluster solutions. Why Join Yotta Labs? Be part of a visionary team aiming to redefine AI infrastructure. Work on cutting-edge technologies that bridge AI and decentralized computing. Collaborate with experts from leading institutions and tech companies. Enjoy a flexible, remote work environment that values innovation and autonomy. How to Apply Interested candidates should apply directly or send their resume and a brief cover letter View email address on click.appcast.io. Please include links to any relevant projects or contributions. #J-18808-Ljbffr Yotta Labs
- A pioneering AI infrastructure company is seeking a GPU Cloud Platform Engineer to design and operate large-scale GPU clusters. This remote position aims to ensure high availability and performance of containerized AI workloads across cloud environments. The ideal candidate...CloudRemote job
- Alumni Ventures is hiring for a Platform Engineering role in New York City, focused on developing an ultrafast AI inference platform. This position... ...challenges like low-level systems development, and efficient GPU workload management. Successful candidates will have 3-5...CloudRemote job
- ...in New York is seeking an experienced infrastructure engineer to build backend services and manage cloud infrastructure. The successful candidate will work... ...Ideal candidates will have over 3 years' experience in platform engineering, proficiently using technologies like...Cloud
- Runpod is seeking a Technical Content Writer to create engaging, in-depth content for our GPU cloud platform that informs and attracts our AI-centric audience. You will partner with marketing, product, and development teams to maintain documentation standards and enhance...CloudRemote job
- Group: Impossible Cloud / Impossible Cloud Network (ICN) Focus:... ...Enterprise Storage & Decentralized GPU Orchestration Location: Zug,... ...to build an AI-first platform encompassing storage, compute... ...Execution: Partner closely with engineering to define specifications, manage...CloudWork experience placementRemote work
- A cutting-edge AI company in New York is seeking a skilled engineer to work on cluster management and GPU infrastructure. You will be responsible for building tools for monitoring and observability while collaborating closely with training teams. Ideal candidates have...Cloud
$180k - $250k
...philanthropic efforts support experienced engineers who are tasked with building... ...for evaluating the complex GPU and AI/ML needs of world-class... ...on a wide range of compute platforms. This role reports to the... ...specific GPU, AI/ML, and HPC and cloud requirements. Technical...CloudLocal area$110k - $140k
...is on a mission to make high‑performance cloud infrastructure easy to use, affordable,... ..., scalable, global Cloud Compute, Cloud GPU, Bare Metal, and Cloud Storage solutions... ...seeking a highly skilled and experienced AI Platform Engineer to own the strategy and execution for...CloudWork at officeImmediate startRemote workFlexible hours- A leading tech company in the United States is seeking an experienced Infrastructure GPU Engineer to build and support high-performance cloud infrastructure. This role involves optimizing resource allocation for GPU workloads, ensuring system reliability, and collaborating...CloudRemote job
$155k - $215k
...with them. As our first dedicated ML Platform Engineer, you'll define the technical direction and... ...today and are investing in hosted GPU inference to support the next generation... ...infrastructure expertise ~ Familiarity with cloud ML services (AWS SageMaker, GCP Vertex...CloudFull timeWork at officeLocal area$140k - $200k
...building a proprietary AI and data platform that powers our investment... ...structured finance. We are engineers and investors working together... ...and real-time), including GPU compute provisioning and container... ...between services. • Manage cloud infrastructure (Azure) including...CloudFlexible hours- ..., primarily in architecture, engineering, and construction, extract structured... ..., and project files. Our platform combines embedding models,... ...agents execute in customer cloud environments. You’ll own the... ...infrastructure inference services, GPU workloads, model serving,...Cloud
$160k - $287k
...building not only new products but also new platforms that reliably create value for both... ...We are seeking a Senior CVML Platform Engineer to help design, build, and evolve the... ...optimize hybrid compute environments (cloud + on‑prem/GPU infrastructure) to support large‑scale...CloudRemote jobFull timeTemporary workImmediate startVisa sponsorship- ...radiology and AI diagnostics platform delivering 24/7 imaging insights... ...-throughput medical imaging, GPU-backed inference, global distribution... ...predictable, and easier for engineers to build on. Why This Role... ...Infrastructure and Cloud Own and evolve Radimal’s AWS and...CloudRemote jobLocal area
$60 - $85 per hour
...Job Description We are sharing a specialised part-time consulting opportunity for professionals experienced in cloud architecture, platform engineering, site reliability, DevOps, DevSecOps, cloud security, FinOps, and structured cloud infrastructure workflows. This...CloudRemote jobHourly payWeekly payJob sharingContract workPart timeFor contractorsFlexible hours- ...Container-based technologies Experience in any of the following cloud service providers - GCP, Azure or AWS. Sound experience with... ...infrastructure and cloud computing Skills Desired 6+ years of Overall Engineering experience 4+ years of experience working in AWS, Azure or GCE...Cloud
- A green technology company is seeking a Software Engineer II to join their remote-first Data Platform team. You will contribute to building and maintaining a multi-cloud Databricks infrastructure, working on CI/CD pipelines and facilitating effective use of Databricks...CloudRemote job
- Senior Platform Engineer Why this Role Matters: At Greenbox Capital, we help small businesses thrive by giving them fast, accessible funding... ...or migration initiatives Experience working in cloud-native environments (preferably Azure) Experience designing...CloudRemote workFlexible hours
$75k - $90k
A cloud infrastructure company based in the United States is seeking an experienced RMA Technical Specialist to enhance the RMA process... ...skills and must adapt to new technologies. Responsibilities include GPU/CPU troubleshooting, vendor interaction, and maintaining...CloudLocal area- A leading media agency is seeking an experienced Data Engineer to design and build scalable cloud-based data platforms. The successful candidate will be responsible for both the architecture and operational reliability of the data pipelines, utilizing skills in Databricks...CloudFlexible hours
$100k - $300k
...Founding- and Staff-level Engineers We are looking for Founding- and Staff-level Engineers... ...foundational pillars of Cogent's data platform and integration pipeline in order to... ...Terraform, Docker, Databricks, etc in multiple clouds For California Based Applicants...Cloud- ...the storage strategy for k0rdent AI. In this role, you will define the roadmap and feature priorities for GPU cloud storage while collaborating with engineering and marketing teams. The ideal candidate should have over 5 years of experience in product management focused...CloudRemote jobFull time
$190k - $230k
...Description Datavant is the data collaboration platform trusted for healthcare. Guided by our... ...We're Looking For: As a Staff Data Engineer at Datavant, you will lead the design and... ...use of data across a multi-tenant, multi-cloud environment. This is a hands-on technical...Cloud- ...of Openings: 1 Location: Remote Software Engineer III - 6-10 Years Experience Required Software... ...and applications into the SCM-ERP platform. You will be responsible for designing and... ...implement scalable, secure, and efficient cloud-based infrastructure for SCM-ERP applications...CloudRemote work
$150k - $300k
...Role: Platform Engineer / DevOps Engineer – Trading Client: Elite FinTech Compensation: $150,000 - $300,000 + Bonus Location: New... ...solutions for scalable deployment across private and public cloud infrastructure. Low Latency: Supporting and optimising a low...CloudImmediate start- ...technology firm located in New York is looking for a skilled DevOps/Platform Engineer to design and build a next-generation deployment platform.... ...Go or Rust, and in-depth knowledge of Kubernetes and multi-cloud environments like AWS or Azure. The role involves...Cloud
- ...modernization across accounting firms. This role involves operating multi-tenant deployments and integrating new firms onto the Modus platform. The ideal candidate will possess solid experience with AWS, Terraform, and scripting. Modus offers competitive compensation,...Cloud
$100k - $120k
Senior Platform Operations Engineer We are looking for an experienced Senior Platform Operations Engineer to build, operate, and evolve our Azure... ...improving developer experience, platform reliability, and cloud adoption. Responsibilities Design, implement, and operate...CloudLocal area- ...The engineering team at Chainalysis is inspired by solving the hardest technical challenges... ...day and our job is to build a flexible platform that will allow us to adapt to those rapid... ...Data Platform Engineer to join our Data Cloud team. This group accelerates innovation...CloudFlexible hours
- The Data Platform Engineer will be responsible for designing, implementing, and deploying scalable data solutions across diverse technical environments... ...working across streaming, data lakes, analytics, and cloud platforms while ensuring strong performance, security,...Cloud
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to GPU Cloud Platform Engineer. Be the first to apply!
- java cloud engineer New York, NY
- senior cloud solutions architect New York, NY
- senior cloud security engineer New York, NY
- cloud network engineer New York, NY
- big data cloud engineer New York, NY
- cloud architect New York, NY
- linux cloud engineer New York, NY
- cloud engineering manager New York, NY
- lead cloud architect New York, NY
- senior cloud data engineer New York, NY


