Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

GPU Cloud Platform Engineer

Yotta Labs

Location: Remote (Global)

Type: Full-time

Company: Yotta Labs

Apply: View email address on click.appcast.io

About Yotta Labs
Yotta Labs is pioneering the development of a Decentralized Operating System (DeOS) for AI workload orchestration at a planetary scale. Our mission is to democratize access to AI resources by aggregating geo-distributed GPUs, enabling high-performance computing for AI training and inference on a wide spectrum of hardware—from commodity to high-end GPUs. Our platform supports major large language models (LLMs) and offers customizable solutions for new models, facilitating elastic and efficient AI development.

️ Role Overview
We are seeking a

GPU Cloud Platform Engineer

to join our core infrastructure team and help build the next-generation AI compute cloud. In this role, you will design, deploy, and operate large-scale, multi-cluster GPU infrastructure across data centers and cloud environments. You will be responsible for ensuring high availability, performance, and efficiency of containerized AI workloads—ranging from LLMs to generative models—deployed in Kubernetes-based GPU clusters. If youre passionate about high-performance systems, distributed orchestration, and scaling real-world AI infrastructure, this role offers a unique opportunity to shape the backbone of our AI cloud platform.

Responsibilities

Build and operate large-scale, high-performance GPU clusters; ensure stable operation of compute, network, and storage systems; monitor and troubleshoot online issues.

Conduct performance testing and evaluation of multi-node GPU clusters using standard benchmarking tools to identify and resolve performance bottlenecks.

Deploy and orchestrate large models (e.g., LLMs, video generation models) across multi-cluster environments using Kubernetes; implement elastic scaling and cross-cluster load balancing to ensure efficient service response under high concurrency for global users.

Participate in the design, development, and iteration of GPU cluster scheduling and optimization systems. Define and lead Kubernetes multi-cluster configuration standards; Optimize scheduling strategies (e.g., node affinity, taints/tolerations) to improve GPU resource utilization.

Build a unified multi-cluster management and monitoring system to support cross-region resource monitoring, traffic scheduling, and fault failover. Collect key metrics such as GPU memory usage, QPS, and response latency in real time; configure alert mechanisms.

Coordinate with IDC providers for planning and deploying large-scale GPU clusters, networks, and storage infrastructure to support internal cloud platforms and external customer needs.

✅ Qualifications

Bachelors degree or higher in Computer Science, Software Engineering, Electronic Engineering, or related fields; 3+ years of experience in system engineering or DevOps.

5+ years of experience in cloud-native development or AI engineering, with at least 2 years of hands‑on experience in Kubernetes multi-cluster management and orchestration.

Familiarity with the Kubernetes ecosystem; hands‑on experience with tools such as kubectl, Helm, and expertise in multi‑cluster deployment, upgrade, scaling, and disaster recovery.

Proficient in Docker and containerization technologies; knowledge of image management and cross-cluster distribution.

Experience with monitoring tools such as Prometheus and Grafana; Has practical experience in GPU fault monitoring and alerting.

Hands‑on experience with cloud platforms such as AWS, GCP, or Azure; understanding of cloud-native multi-cluster architecture.

Experience with cluster management tools such as Ray, Slurm, KubeSphere, Rancher, Karmada is a plus.

Familiarity with distributed file systems such as NFS, JuiceFS, CephFS, or Lustre; ability to diagnose and resolve performance bottlenecks.

Understanding of high-performance communication protocols such as IB, RoCE, NVLink, and PCIe.

Strong communication skills, self‑motivation, and team collaboration

Preferred Experience

Experience in developing and operating MaaS platforms or large-scale model inference clusters. Proven track record of leading multi-cluster system development or performance optimization projects.

Proficiency in CUDA programming and the NCCL communication library; understanding of high-performance GPUs like H100.

Ability to develop standardized inference APIs (RESTful/gRPC) and automation tools using Golang or Python.

Hands‑on experience with optimization techniques such as model quantization, static compilation, and multi‑GPU parallelism; capable of profiling inference processes in multi-cluster setups and identifying bottlenecks like memory fragmentation and low compute efficiency.

Active engagement with open-source communities such as Hugging Face and GitHub; deep understanding of the design principles of inference frameworks like Triton, vLLM, and SGLang; ability to perform secondary development and optimization based on open-source projects and quickly translate cutting-edge techniques into production-ready multi-cluster solutions.

Why Join Yotta Labs?

Be part of a visionary team aiming to redefine AI infrastructure.

Work on cutting-edge technologies that bridge AI and decentralized computing.

Collaborate with experts from leading institutions and tech companies.

Enjoy a flexible, remote work environment that values innovation and autonomy.

How to Apply
Interested candidates should apply directly or send their resume and a brief cover letter View email address on click.appcast.io. Please include links to any relevant projects or contributions.

#J-18808-Ljbffr
Vacancy posted 10 hours ago
Similar jobs that could be interesting for youBased on the GPU Cloud Platform Engineer in Richmond, VA vacancy
  • $110k - $140k

     ...is on a mission to make high‑performance cloud infrastructure easy to use, affordable,...  ..., scalable, global Cloud Compute, Cloud GPU, Bare Metal, and Cloud Storage solutions...  ...seeking a highly skilled and experienced AI Platform Engineer to own the strategy and execution for... 
    Cloud
    Work at office
    Immediate start
    Remote work
    Flexible hours

    Vultr

    Richmond, VA
    2 days ago
  • A leading data platform provider is seeking a Software Engineer to join their Compute Platform team. The role involves designing platform APIs and improving Kubernetes management for a cloud-native environment. Candidates should have over 8 years of experience, expertise... 
    Cloud
    Remote work

    Confluent

    Richmond, VA
    4 days ago
  •  ...Senior Platform Engineer Why this Role Matters: At Greenbox Capital, we help small businesses thrive by giving them fast, accessible funding...  ...modernization or migration initiatives Experience working in cloud-native environments (preferably Azure) Experience designing and... 
    Cloud
    Remote work
    Flexible hours

    Greenbox Capital

    Richmond, VA
    2 days ago
  •  ...Zoomcar is seeking an AI and GPU Infrastructure GTM Champion to drive its growth by defining and executing a comprehensive Go-to-Market (GTM) strategy. This role requires deep expertise in cloud and AI technologies, and the ability to build and lead enthusiastic, high-... 
    Cloud

    Zoomcar

    Richmond, VA
    23 hours ago
  •  ...Openings: 1 Location: Remote Software Engineer III - 6-10 Years Experience Required...  ...features and applications into the SCM-ERP platform. You will be responsible for designing and...  ...scalable, secure, and efficient cloud-based infrastructure for SCM-ERP applications... 
    Cloud
    Remote work

    Changeis

    Richmond, VA
    10 hours ago
  • $100k - $120k

     ...Senior Platform Operations Engineer We are looking for an experienced Senior Platform Operations Engineer to build, operate, and evolve our Azure...  ...improving developer experience, platform reliability, and cloud adoption. Responsibilities Design, implement, and operate... 
    Cloud
    Local area

    Press Ganey Associates LLC

    Richmond, VA
    23 hours ago
  •  ...The Data and Analytics Engineering - Platform team is critical to making our organization's data strategy successful. You will play a key role...  ...Software Engineering, DevOps, and hands-on experience with Azure Cloud services. You will work on cross-functional initiatives and... 
    Cloud

    Experis/Manpower Group

    Richmond, VA
    1 day ago
  •  ...A technology recruitment partner is looking for a skilled Cloud Platform Engineer to design and maintain cloud infrastructure for federal clients in the United States. The role requires hands-on engineering and strategic system design, particularly in hybrid and multi-... 
    Cloud

    Jobgether

    Richmond, VA
    2 days ago
  •  ...Ein führendes Unternehmen im Cloud-Bereich sucht einen erfahrenen Kubernetes-Administrator, der für die Weiterentwicklung und den Betrieb der Kubernetes-Plattform verantwortlich ist. Diese Rolle erfordert Kenntnisse in Infrastructure as Code und CI/CD-Prozessen sowie flie... 
    Cloud
    Remote work

    Cloudiax AG

    Richmond, VA
    2 days ago
  • $100k - $160k

     ...Xsolla is looking for a skilled DevOps/SRE/Platform engineer in Los Angeles, California. The role involves managing Kubernetes clusters, implementing cloud infrastructure, and overseeing Linux servers. The ideal candidate will have 4 to 7 years of experience in DevOps... 
    Cloud

    Xsolla

    Richmond, VA
    23 hours ago
  •  ...Temporal is seeking a Senior Software Engineer for its Compute team. This role focuses on building managed compute primitives and ensuring the operational success of cloud services. Candidates should possess strong experience with distributed systems and a passion for... 
    Cloud

    Temporal

    Richmond, VA
    1 day ago
  •  ...Platform Engineer | Remote | AI Solutions Team Looking to work on cutting edge AI technology? Want to make a real impact on a global cloud platform? We’re working with a leading tech‑driven organisation looking for a Platform Engineer to join their AI solutions team,... 
    Cloud
    Remote work

    Opus Recruitment Solutions

    Richmond, VA
    2 days ago
  • $123k - $150k

    A technology company is seeking a skilled Platform Engineer to ensure the uptime and availability of infrastructure. Required qualifications...  ..., an understanding of Unix/Linux systems, and proficiency in cloud services. This position offers a salary range of $123k to $15... 
    Cloud

    Frontdoor

    Richmond, VA
    4 days ago
  •  ...EPAM Systems, Inc. is seeking a highly qualified Platform Engineer to develop a cutting-edge Kubernetes-based platform. This position is essential...  ...engineering teams to deploy scalable applications in a cloud-native environment. The ideal candidate will have a robust software... 
    Cloud
    Remote work

    EPAM Systems Inc

    Richmond, VA
    2 days ago
  • $40 - $75 per hour

     ...Advertisers is seeking an experienced DevOps Engineer to optimize and manage Kubernetes clusters,...  .... The ideal candidate has over 7 years of platform engineering experience, with strong skills in Kubernetes, Terraform, and cloud platforms. This remote position offers an... 
    Cloud
    Hourly pay
    Remote work

    Verve For Advertisers

    Richmond, VA
    15 hours ago
  •  ...treXis is looking for a Cloud Infrastructure Engineer to design, build, and manage scalable cloud infrastructures using AWS, Azure, or GCP. The...  ...like Terraform, and integrating security practices into platform pipelines. Candidates should have experience with Docker,... 
    Cloud

    Trexis Inc. Defunct

    Richmond, VA
    2 days ago
  •  ...Ontrac Solutions is looking for an experienced GCP Cloud Platform Engineer in the United States. The successful candidate will support cloud infrastructure initiatives and collaborate with engineering teams and clients. Required qualifications include extensive GCP experience... 
    Cloud

    Ontrac Solutions Inc

    Richmond, VA
    2 days ago
  •  ...Kforce Systems is looking for an experienced Application Engineer to manage the talon.one platform. You will take ownership of its operational stability...  ...strong skills in API technologies, and familiarity with cloud principles. This role is crucial for ensuring system... 
    Cloud

    Kforce Systems

    Richmond, VA
    15 hours ago
  •  ...Linux Foundation Co seeks a Platform Engineering Systems and Software Engineer in the LF Education department. The role involves developing...  ...platform for hands-on certifications and training labs, utilizing cloud-native technologies to serve thousands of learners monthly.... 
    Cloud

    Linux Foundation

    Richmond, VA
    10 hours ago
  •  ...Project description We are actively seeking a talented Palantir Data Platform Engineer with strong proficiency in Python programming, modern data engineering practices, and cloud-based platforms. The ideal candidate will have experience building scalable data pipelines... 
    Cloud

    Luxoft

    Richmond, VA
    23 hours ago
  • $124k - $156k

     ...Insight Software is seeking a Principal Software Engineer for the Platform Services team in the United States. The role involves overseeing the...  ...observability of the Certent Equity Management platform, focusing on cloud-native modernization. Candidates should have over 8 years... 
    Cloud

    insightsoftware

    Richmond, VA
    2 days ago
  •  ...BeyondTrust is seeking a Sr Software Development Engineer in Canada to launch and support new cloud-native secrets management solutions. This role requires designing and developing scalable platform components while contributing to project leadership in an Agile environment... 
    Cloud

    BeyondTrust

    Richmond, VA
    2 days ago
  •  ...AI company in the United States is looking for a seasoned Software Engineer to enhance their global platform infrastructure. You will drive technical initiatives, ensuring operational excellence in cloud networking and contribute to developing automated solutions in a... 
    Cloud

    Elasticsearch B.V.

    Richmond, VA
    4 days ago
  •  ...Coral AI is looking for a Backend Engineer – Platform to join our team in Bengaluru. In this role, you will design and build scalable systems...  ...programming languages like Go, Java, or Python, and familiarity with cloud services and containerization. We offer a competitive salary... 
    Cloud

    Coral AI

    Richmond, VA
    2 days ago
  •  ...Sedona Digital is looking for a Senior Platform Engineer to enhance cloud services and drive automation using cutting-edge technologies. The ideal candidate will have at least 7 years of experience with AWS, Kubernetes, and CI/CD pipeline management. Responsibilities... 
    Cloud
    Trial period

    Sedona Digital

    Richmond, VA
    2 days ago
  •  ...with 28,396 DevOps professionals. The Senior Infrastructure Platform Engineer position at Jobicy is an exciting opportunity for tech enthusiasts...  ...architectural design and enhance the security posture of the cloud environments, ensuring optimal performance and reliability.... 
    Cloud
    Remote work

    DevOpsChat

    Richmond, VA
    2 days ago
  • $118k - $248k

     ...Indeed, Inc. is hiring a Software Engineer III to develop platform services connecting AI-powered products with various language models. In this role...  .... The ideal candidate will have extensive experience in cloud infrastructure, specifically within AWS, GCP, or Azure,... 
    Cloud

    Indeed, Inc., c/o CT Corporation (Indeed.com)

    Richmond, VA
    2 days ago
  •  ...Design, build, and manage scalable cloud infrastructure across multiple environments....  ...workloads with Docker and orchestration platforms like Kubernetes. Establish monitoring, logging...  .... Provide guidance and mentorship to engineers on platform best practices. Technologies... 
    Cloud

    Trexis Inc. Defunct

    Richmond, VA
    2 days ago
  •  ...Senior Platform Engineer The Senior Platform Engineer position at Remotive involves a strong focus on enhancing cloud infrastructure and ensuring robust DevOps practices. Candidates will work collaboratively to improve system reliability and performance, actively engaging... 
    Cloud
    Remote work

    DevOpsChat

    Richmond, VA
    2 days ago
  •  ...Strong understanding of container image security, vulnerability scanning tools and remediation processes. Familiarity with cloud service platforms (AWS, Azure, GCP) and container orchestration (EKS, AKS, GKE) in a cloud environment. Experience with CI/CD pipelines and integrating... 
    Cloud
    Remote work

    Soroc Technology

    Richmond, VA
    2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to GPU Cloud Platform Engineer. Be the first to apply!