Senior Site Reliability Engineer
hace 3 semanas
We are a US-based outsource software development company that has been delivering exceptional software experience to our clients since 2011, helping technology companies to become industry leaders. Over the past few years, we’ve been hiring specialists all over the world while our main development centers were in Ukraine. Now, we keep expanding and start growing our centers in different parts of the world. Dev.Pro is open to hire specialists from other countries as well as Ukrainians who live outside of Ukraine now. We stand with Ukraine and keep supporting our people by offering a friendly remote environment while adhering to the values of democracy, human rights, and state sovereignty. About this opportunity We invite a skilled and experienced Senior Site Reliability Engineer to join our fully remote, international team. In this role, you’ll ensure our GPU clusters and supporting AI infrastructure are reliable, resilient, automated, and observable at scale. You’ll work with NVIDIA, Slurm, and Kubernetes to turn bare-metal GPU clusters into high-performance AI infrastructure. What's in it for you: Join a fast-scaling company shaping the future of AI infrastructure in Europe Scale, optimize, and automate bare-metal GPU clusters for some of the most compute-intensive AI workloads Collaborate with a top-tier international team and grow through global AI and cloud events Is that you? 5+ years as an SRE, DevOps, or HPC engineer in large-scale compute environments Expertise in HPC workload managers (Slurm, PBS Pro, LSF) Strong Python or Go skills for automation and observability Infrastructure-as-code experience (Terraform, Ansible, Helm) Kubernetes experience for AI workloads (vLLM, Ray, Triton Inference Server) GPU resource management knowledge (MIG, NCCL, CUDA, containers) Experience with storage systems (VAST, WEKA, DDN) and parallel filesystems (GPFS, Lustre) Linux systems engineering, CI/CD, and configuration management skills Strategic thinking with strong technical and business communication Organization, autonomy, adaptability Advanced English level Desirable: Exposure to BlueField DPU, NVSwitch, or Slurm-on-Kubernetes hybrid orchestration Key responsibilities and your contribution In this role, you’ll apply your expertise to ensure our GPU clusters and AI infrastructure run reliably, efficiently, and at scale. Automate deployment, scaling, and lifecycle management of GPU clusters Optimize HPC scheduling and AI workload orchestration, including job preemption and GPU affinity Implement observability and monitoring across GPU, NVLink, InfiniBand, and storage layers Ensure reliability and uptime through SLOs, error budgets, chaos testing, and automated remediation Collaborate with teams to optimize performance, resources, and fault recovery at petascale #J-18808-Ljbffr
-
Lead Site Reliability Engineer
hace 1 semana
Buenos Aires, Argentina Ecolab A tiempo completoJOB DESCRIPTION Elevate your engineering prowess to unprecedented levels by joining a team of exceptionally gifted professionals and position yourself among the top echelon in site reliability.The Infrastructure Engineering team is responsible for the design and engineering of solutions and technologies working with other engineering teams to support the...
-
Site Reliability Engineer Observability Lead
hace 2 semanas
Buenos Aires, Argentina Unilever A tiempo completoSite Reliability Engineer Observability Lead Responsibilities Create a robust observability framework, including an APM, alarming, dashboarding, event correlation, integrated to an existing observability platform. Perform analytics on previous incidents and usage patterns to better predict issues and take proactive actions. Troubleshoot priority incidents,...
-
Site Reliability Engineer
hace 6 días
Capital Federal, Buenos Aires, Argentina Rp consultoria A tiempo completoNos encontramos en búsqueda de un/a **Ssr. Site Reliability Engineer **para incorporar a nuestro equipo en Buenos Aires, Argentina. ¿Qué buscamos en un **Ssr. Site Reliability Engineer**? Ser un colaborador activo de la automatización de tareas que necesiten intervención manual en el ciclo de desarrollo de software. Con muchas ganas de aprender,...
-
Senior Site Reliability Engineer
hace 2 días
Buenos Aires, Argentina Neara A tiempo completoNeara is a high-growth, venture-backed Series B, tech company headquartered in Sydney, Australia. We work with 75% of the utilities in Australia and New Zealand and are growing rapidly across the US and Europe. Our mission is to revolutionise the utilities industry by helping them future-proof their infrastructure and navigate the challenges of the clean...
-
Site Reliability Engineer
hace 6 días
Buenos Aires, Argentina Launchpad Technologies A tiempo completoLaunchpad, a people-first technology company, is a leader in North America´s rapidly growing tech sector. Through two solutions, Launchpad supports its clients with digital transformation: - PaasportTM, our iPaaS solution, streamlines software integration and automates workflows. - Nearshore Staff Augmentation, our managed IT staffing service, connects top...
-
Senior Ai Site Reliability Engineer
hace 1 semana
Buenos Aires, Argentina SQUIRE A tiempo completo**WHO WE ARE** SQUIRE is the leading business management system designed for the needs of barbers, shop owners, and their communities. We believe the pursuit of artistry and autonomy should not be restricted by the complexities of running a business. With SQUIRE, we provide custom-branded tools, resources, and guidance to help barbers of all stages and...
-
Site Reliability Engineer
hace 2 semanas
Buenos Aires, Argentina Exxon Mobil A tiempo completoA global energy company is seeking a Site Reliability Engineer to manage and automate operations in Buenos Aires. Ideal candidates should have a Bachelor's degree and over 2 years of experience in site reliability or infrastructure engineering, specifically within a DevOps framework. The role involves developing infrastructure as code, performance...
-
Senior Site Reliability Engineer
hace 22 horas
Capital Federal, Buenos Aires, Argentina Business Commercial Management A tiempo completoBCM Uruguay is Hiring! Senior Site Reliability Engineer - Remote Remote - LATAM **English Level**: B2+ / C1 - Advanced Contractor - USD ⏱ Full-Time Para empresa multinacional de servicios en ingeniería digital, especialista en software de última generación y en desarrollo de productos digitales. Cuando una idea aparece, nacen la motivación y el deseo...
-
Site Reliability Engineer
hace 2 días
Buenos Aires, Buenos Aires C.F., Argentina Blockscout Limited A tiempo completoBlockscout is a leading provider of indexing and UI services for EVM chains. Our team hosts explorers for many of the largest chains in the industry. Reliability is vital to our company's success. We are looking for a Site Reliability Engineer to strengthen our DevOps and Support teams.Key responsibilitiesMonitor systems: Proactively watch production systems...
-
Sre - Site Reliability Engineer - Remoto - 1526
hace 22 horas
Buenos Aires, Argentina Web: A tiempo completoDescripción del empleo: ¿Qué hace la compañía? **Empresa de ingeniería digital que desde 2009 se dedica a mejorar equipos de productos digitales y facilitar iniciativas de transformación digital.** Con más de 1000 empleados en cinco países (México, Colombia, Bolivia, Argentina, Irlanda del Norte), su enfoque innovador de "flujos de trabajo" asegura...