Senior Site Reliability Engineer
hace 3 semanas
Overview We are a US-based outsource software development company that has been delivering exceptional software experience to our clients since 2011, helping technology companies to become industry leaders. Over the past few years, we’ve been hiring specialists worldwide while our main development centers were in Ukraine. We now continue to expand and grow centers in different parts of the world. Dev.Pro is open to hire specialists from other countries as well as Ukrainians living outside of Ukraine. We stand with Ukraine and support our people by offering a friendly remote environment while adhering to the values of democracy, human rights, and state sovereignty. About This Opportunity We invite a skilled and experienced Senior Site Reliability Engineer to join our fully remote, international team. In this role, you’ll ensure our GPU clusters and supporting AI infrastructure are reliable, resilient, automated, and observable at scale. You’ll work with NVIDIA, Slurm, and Kubernetes to turn bare-metal GPU clusters into high-performance AI infrastructure. What’s in it for you Join a fast-scaling company shaping the future of AI infrastructure in Europe Scale, optimize, and automate bare-metal GPU clusters for some of the most compute-intensive AI workloads Collaborate with a top-tier international team and grow through global AI and cloud events Qualifications 5+ years as an SRE, DevOps, or HPC engineer in large-scale compute environments Expertise in HPC workload managers (Slurm, PBS Pro, LSF) Strong Python or Go skills for automation and observability Infrastructure-as-code experience (Terraform, Ansible, Helm) Kubernetes experience for AI workloads (vLLM, Ray, Triton Inference Server) GPU resource management knowledge (MIG, NCCL, CUDA, containers) Experience with storage systems (VAST, WEKA, DDN) and parallel filesystems (GPFS, Lustre) Linux systems engineering, CI/CD, and configuration management skills Strategic thinking with strong technical and business communication Organization, autonomy, adaptability Advanced English level Desirable Exposure to BlueField DPU, NVSwitch, or Slurm-on-Kubernetes hybrid orchestration Key Responsibilities Automate deployment, scaling, and lifecycle management of GPU clusters Optimize HPC scheduling and AI workload orchestration, including job preemption and GPU affinity Implement observability and monitoring across GPU, NVLink, InfiniBand, and storage layers Ensure reliability and uptime through SLOs, error budgets, chaos testing, and automated remediation Collaborate with teams to optimize performance, resources, and fault recovery at petascale Employment details Seniority level: Mid-Senior level Employment type: Full-time Job function: Other Industries: IT Services and IT Consulting Referrals increase your chances of interviewing at Dev.Pro by 2x #J-18808-Ljbffr
-
Senior Site Reliability Engineer
hace 6 días
Municipio de Rincón de los Sauces, Argentina Dev.Pro A tiempo completoSenior Site Reliability Engineer 5 days ago Be among the first 25 applicants
-
Site Reliability Engineer
hace 4 semanas
Municipio de Rincón de los Sauces, Argentina redbee A tiempo completoSenior Site Reliability Engineer Buscamos profesionales con curiosidad y pasión, jugadores de equipo, que quieran crecer, innovar y aprender de las últimas tecnologías. Hacemos diferente y estamos en movimiento! Tenemos una cultura A.C.T.I.VA y nuestro propósito es ayudarte a elegir el mejor camino para que la rompas. Requisitos para esta vacante...
-
Site Reliability Engineer
hace 3 días
Municipio de Rincón de los Sauces, Argentina Sur Global A tiempo completoOverview As the Site Reliability Engineer you will support and scale the infrastructure powering their secure, mission-critical SaaS platform. You must be confident in operating and debugging both modern infrastructure (cloud-native, containerized services) and classic Windows production environments (IIS, SQL Server AlwaysOn, Service Broker), with the...
-
Senior Site Reliability Engineer, Observability
hace 4 semanas
Rincón de los Sauces, Argentina Chainlink Labs A tiempo completoSenior Site Reliability Engineer, Observability Chainlink Labs is the primary contributing developer of Chainlink, the decentralized computing platform powering the verifiable web. Chainlink is the industry-standard platform for providing access to real-world data, offchain computation, and secure cross-chain interoperability across any blockchain. Chainlink...
-
Senior SRE
hace 6 días
Municipio de Rincón de los Sauces, Argentina Dev.Pro A tiempo completoA global software development company is seeking a Senior Site Reliability Engineer to enhance the stability and efficiency of its services. This remote role involves leading initiatives to improve monitoring and operational excellence while collaborating with engineering teams. Applicants should have over 5 years of experience in SRE or observability,...
-
Senior Site Reliability Engineer
hace 2 semanas
Municipio de Rincón de los Sauces, Argentina Canonical A tiempo completoJoin to apply for the Senior Site Reliability Engineer role at Canonical . Role Overview Next‑gen operations at scale, with pure Python infra‑as‑code, from bare metal to containers and applications. Our goal is to perfect enterprise infrastructure DevOps. We run hundreds of private cloud, Kubernetes, and application clusters for customers across...
-
Site Reliability Engineer
hace 2 semanas
Municipio de Esquel, Argentina MyPetroCareer.com A tiempo completoJoin to apply for the Site Reliability Engineer role at MyPetroCareer.com ExxonMobil Business Support Center Argentina S.R.L; empresa afiliada a Exxon Mobil Corporation (*) About Us At ExxonMobil, our vision is to lead in energy innovations that advance modern living and a net-zero future. As one of the world's largest publicly traded energy and chemical...
-
Senior DevOps
hace 4 semanas
Municipio de Esquel, Argentina INGENIEROJOB A tiempo completoSenior DevOps / Site Reliability Engineer (Azure) (Ref-Lch) We are looking for a highly skilled Senior DevOps / Site Reliability Engineer with deep experience in Azure cloud, CI/CD automation, and secure workload identity. This role is ideal for someone who masters modern DevOps practices, understands cloud architecture at scale, and can lead the design and...
-
Site Reliability Engineer Junior
hace 4 semanas
Municipio de Rincón de los Sauces, Argentina Whitestack A tiempo completoBuscamos Ingenieros junior con alto potencial e interesados en desarrollar su carrera y sumergirse en nuevas tecnologías, orientadas al desarrollo de software e ingeniería cloud. Como Site Reliability Engineer Junior tendrás la oportunidad de colaborar en tareas que contribuyan a la estabilidad, disponibilidad y desempeño de infraestructuras cloud...
-
Site Reliability Engineer
hace 2 semanas
Municipio de Esquel, Argentina MyPetroCareer.com A tiempo completoA global energy company is seeking a Site Reliability Engineer to manage and automate operations in a dynamic environment. The role includes developing infrastructure as code and ensuring system reliability. Candidates must have a bachelor's degree in a relevant field and 2+ years of experience in site reliability engineering or infrastructure with...