GPU Cloud Platform Engineer
hace 1 hora
Join to apply for the GPU Cloud Platform Engineer role at Yotta Labs 1 week ago Be among the first 25 applicants. Location: Remote (Global) Type: Full-time Company: Yotta Labs Apply: About Yotta Labs Yotta Labs is pioneering the development of a Decentralized Operating System (DeOS) for AI workload orchestration at a planetary scale. Our mission is to democratize access to AI resources by aggregating geo‑distributed GPUs, enabling high‑performance computing for AI training and inference on a wide spectrum of hardware—from commodity to high‑end GPUs. Our platform supports major large language models (LLMs) and offers customizable solutions for new models, facilitating elastic and efficient AI development. Role Overview We are seeking a GPU Cloud Platform Engineer to join our core infrastructure team and help build the next‑generation AI compute cloud. In this role, you will design, deploy, and operate large‑scale, multi‑cluster GPU infrastructure across data centers and cloud environments. You will be responsible for ensuring high availability, performance, and efficiency of containerized AI workloads—ranging from LLMs to generative models—deployed in Kubernetes‑based GPU clusters. If you're passionate about high‑performance systems, distributed orchestration, and scaling real‑world AI infrastructure, this role offers a unique opportunity to shape the backbone of our AI cloud platform. Responsibilities Build and operate large‑scale, high‑performance GPU clusters; ensure stable operation of compute, network, and storage systems; monitor and troubleshoot online issues. Conduct performance testing and evaluation of multi‑node GPU clusters using standard benchmarking tools to identify and resolve performance bottlenecks. Deploy and orchestrate large models (e.g., LLMs, video generation models) across multi‑cluster environments using Kubernetes; implement elastic scaling and cross‑cluster load balancing to ensure efficient service response under high concurrency for global users. Participate in the design, development, and iteration of GPU cluster scheduling and optimization systems. Define and lead Kubernetes multi‑cluster configuration standards; optimize scheduling strategies (e.g., node affinity, taints/tolerations) to improve GPU resource utilization. Build a unified multi‑cluster management and monitoring system to support cross‑region resource monitoring, traffic scheduling, and fault failover. Collect key metrics such as GPU memory usage, QPS, and response latency in real time; configure alert mechanisms. Coordinate with IDC providers for planning and deploying large‑scale GPU clusters, networks, and storage infrastructure to support internal cloud platforms and external customer needs. Qualifications Bachelor's degree or higher in Computer Science, Software Engineering, Electronic Engineering, or related fields; 3+ years of experience in system engineering or DevOps. 5+ years of experience in cloud‑native development or AI engineering, with at least 2 years of hands‑on experience in Kubernetes multi‑cluster management and orchestration. Familiarity with the Kubernetes ecosystem; hands‑on experience with tools such as kubectl, Helm, and expertise in multi‑cluster deployment, upgrade, scaling, and disaster recovery. Proficient in Docker and containerization technologies; knowledge of image management and cross‑cluster distribution. Experience with monitoring tools such as Prometheus and Grafana; practical experience in GPU fault monitoring and alerting. Hands‑on experience with cloud platforms such as AWS, GCP, or Azure; understanding of cloud‑native multi‑cluster architecture. Experience with cluster management tools such as Ray, Slurm, KubeSphere, Rancher, Karmada is a plus. Familiarity with distributed file systems such as NFS, JuiceFS, CephFS, or Lustre; ability to diagnose and resolve performance bottlenecks. Understanding of high‑performance communication protocols such as IB, RoCE, NVLink, and PCIe. Strong communication skills, self‑motivation, and team collaboration. Preferred Experience Experience in developing and operating MaaS platforms or large‑scale model inference clusters. Proven track record of leading multi‑cluster system development or performance optimization projects. Proficiency in CUDA programming and the NCCL communication library; understanding of high‑performance GPUs like H100. Ability to develop standardized inference APIs (RESTful/gRPC) and automation tools using Golang or Python. Hands‑on experience with optimization techniques such as model quantization, static compilation, and multi‑GPU parallelism; capable of profiling inference processes in multi‑cluster setups and identifying bottlenecks like memory fragmentation and low compute efficiency. Active engagement with open‑source communities such as Hugging Face and GitHub; deep understanding of the design principles of inference frameworks like Triton, vLLM, and SGLang; ability to perform secondary development and optimization based on open‑source projects and quickly translate cutting‑edge techniques into production‑ready multi‑cluster solutions. Why Join Yotta Labs Be part of a visionary team aiming to redefine AI infrastructure. Work on cutting‑edge technologies that bridge AI and decentralized computing. Collaborate with experts from leading institutions and tech companies. Enjoy a flexible, remote work environment that values innovation and autonomy. How to Apply Interested candidates should apply directly or send their resume and a brief cover letter to Please include links to any relevant projects or contributions. #J-18808-Ljbffr
-
Remote GPU Cloud Platform Engineer for AI Infra
hace 1 hora
, , Argentina Yotta Labs A tiempo completoA pioneering AI technology firm is seeking a GPU Cloud Platform Engineer to join their team. This role involves building and operating large-scale GPU clusters, ensuring stability and performance in cloud environments. The ideal candidate will have extensive experience in cloud-native development, particularly with Kubernetes and containerization...
-
Cloud Engineer – Platform
hace 1 hora
, , Argentina BETSOL A tiempo completoOverview We are seeking a hands-on Cloud Engineer to design, build, automate, and operate cloud infrastructure and platform services. This role requires strong experience with Infrastructure as Code, container orchestration, and cloud-native databases, along with a mindset focused on reliability, automation, and operational excellence. You will work closely...
-
GCP AI Platform MLOps Engineer
hace 1 hora
, , Argentina DaCodes. A tiempo completoGCP AI Platform MLOps Engineer (DevOps + Machine Learning Operations) 1 day ago Be among the first 25 applicants ¡Trabaja en DaCodes! Somos una firma de expertos en software y transformación digital de alto impacto. Durante 10 años hemos creado soluciones enfocadas en la tecnología e innovación gracias a nuestro equipo de +220 talentosos #DaCoders,...
-
Cloud Platform Engineer: IaC, Kubernetes
hace 1 hora
, , Argentina BETSOL A tiempo completoA dynamic tech firm in Argentina is seeking an experienced Cloud Engineer to design, build, automate, and operate cloud infrastructure and platform services. The ideal candidate should have over 4 years of hands-on experience with DevOps engineering and expertise in tools like Terraform and Kubernetes. Responsibilities include managing cloud infrastructure,...
-
Senior Java Engineer
hace 2 semanas
, , Argentina Hitachi Vantara Corporation A tiempo completoA global technology company based in Argentina is seeking a Specialist Java Engineer to join their team and work on an AdTech platform. The role requires 4+ years of software development experience, strong Java skills, and proficiency in cloud services. The candidate will participate in agile practices, design new solutions, and enhance existing services....
-
Senior DevOps Engineer — Remote Cloud Platform
hace 1 hora
, , Argentina Nerdy A tiempo completoA leading educational technology company is seeking a Senior DevOps Engineer to join their remote team in Argentina. The ideal candidate will have over 4 years of experience optimizing cloud infrastructure, especially with AWS and Kubernetes. This role involves managing security, CI/CD processes, and collaborating with development teams to enhance platform...
-
Remote DevSecOps Engineer
hace 1 hora
, , Argentina Jobgether A tiempo completoA technology recruiting platform is seeking a DevSecOps Engineer remote from Argentina. This role involves embedding security throughout CI/CD pipelines, navigating cloud environments, and enhancing platform security. Candidates should have over 5 years of relevant experience, strong cloud knowledge (AWS, GCP, Azure), and programming skills in Python or Go....
-
, , Argentina Alpaca A tiempo completoA leading tech company is looking for a Senior Data Platform Engineer in Argentina. This role focuses on designing and developing scalable data management solutions, handling more than 100 million events daily. Candidates should have extensive experience in data engineering, proficiency in Python and SQL, and familiarity with cloud technologies like Google...
-
, , Argentina Devsu A tiempo completoA leading tech company in Argentina seeks a GCP Backend / Cloud Engineer to design and deploy cloud-native solutions. The ideal candidate has over 5 years of experience with Google Cloud Platform and strong backend development skills in Python. This role emphasizes collaboration within diverse teams and offers a remote-friendly culture with opportunities for...
-
Platform Engineer
hace 1 hora
, , Argentina PadSplit A tiempo completoOverview PadSplit is looking for a senior Platform Engineer to strengthen our backend systems and ensure the stability, scalability, and performance of our core marketplace platform. This role is critical as we modernize our architecture, build new services, and improve reliability across our Django- and AWS-powered infrastructure. While primarily...