See all roles

AI Infra Engineer – SRE (Kubernetes)

Work from home Full-time role Hiring

Job Category: Software Engineering Job Type: Full Time Job Location: Hybrid Remote About The Role We are a fast-growing AI infrastructure company building cutting-edge GPU cloud platforms and high-performance inference solutions that empower AI developers, startups, and enterprises worldwide. As we scale our global operations, we are looking for a skilled and hands-on AI Infra Engineer – SRE (Kubernetes) to join our Global Infrastructure team. Role Overview This is a critical hands-on position focused on the reliability, performance, and operational excellence of large-scale, high-performance AI/ML GPU clusters in our data centers. As an AI Infra Engineer – SRE (Kubernetes), you will design, operate, and optimize Kubernetes-based infrastructure to ensure maximum uptime, efficiency, and scalability for demanding AI workloads. You will bring deep expertise in system-level troubleshooting, GPU cluster management, and automation to keep our platforms running at peak performance.

Key Responsibilities

  • Design, build, and maintain scalable, production-grade AI/ML infrastructure using Kubernetes.
  • Proactively monitor GPU cluster health, performance, and utilization across compute, accelerators, storage, and networking layers, performing root-cause analysis and resolution.
  • Develop and implement automation for infrastructure provisioning, configuration, and ongoing management.
  • Own the complete GPU node lifecycle — including provisioning, dynamic scaling, maintenance, decommissioning, and zero-downtime upgrades of GPU-enabled nodes in Kubernetes environments.
  • Build and improve CI/CD pipelines for reliable infrastructure deployment and orchestration.
  • Enforce security best practices, compliance standards, and operational excellence across the infrastructure stack.
  • Lead incident response and post-incident improvements for issues related to GPUs, CPUs, high-speed storage, and networks.
  • Manage end-to-end customer GPU resource provisioning — from request intake and configuration to onboarding, troubleshooting, and support — ensuring high levels of customer satisfaction.
  • Stay up to date with the latest GPU hardware, software, and orchestration technologies, integrating relevant advancements into our platforms.
  • Be available for occasional regional or international travel to data center locations as required.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or a related technical field.
  • 3+ years of practical experience in data center operations, infrastructure engineering, or site reliability engineering.
  • Strong background in infrastructure automation using tools such as Terraform and Ansible.
  • Deep hands-on experience with Kubernetes in large-scale environments, including:
  • NVIDIA GPU Operator for GPU driver management, device plugins, container toolkit, and monitoring (DCGM).
  • NVIDIA Network Operator for high-performance networking, RDMA, and GPUDirect support.
  • CNI (Container Network Interface) and CSI (Container Storage Interface) plugins tailored for AI/ML workloads.
  • Integration with job schedulers such as Slurm in Kubernetes clusters.
  • Proficiency in Linux system administration and scripting (Python, Bash).
  • Experience with observability stacks including Prometheus, Grafana, and Loki.
  • Solid understanding of GPU architecture, NVIDIA CUDA, NCCL, and AI/ML frameworks is a strong plus.
  • Excellent troubleshooting skills with the ability to analyze complex system logs and performance metrics.
  • Strong communication and collaboration skills to work effectively with engineering and operations teams.

Apply tot his job Apply To this Job

You might like

Senior Site Reliability Engineer

Work from home Full-time role

Sr. Site Reliability Engineer (SRE)

Work from home Full-time role

Site Reliability Engineer, SRE

Work from home Full-time role

Remote role of Site Reliability Engineer Systems Analyst VII (SRE)

Work from home Full-time role

Site Reliability Engineer – SRE

Work from home Full-time role

Software Engineer - Python/Golang - Kubernetes

Work from home Full-time role

Kubernetes Platform Engineer

Work from home Full-time role

Senior System Software Engineer, Kubernetes and KubeVirt

Work from home Full-time role

Senior Site Reliability Engineer, Core Cloud Engineering

Work from home Full-time role

Site Reliability Engineer - Remote - US

Work from home Full-time role

Sales Manager, CHC/PCC, North America (U.S.A. Remote)

Work from home Full-time role

Experienced Customer Success Manager – Investment Management Solutions

Work from home Full-time role

E&SS Programs Ops, Consultant Business Insights

Work from home Full-time role

Junior Devops Engineer

Work from home Full-time role

Community Manager

Work from home Full-time role

Experienced Customer Service Representative – 100% Remote Opportunity with arenaflex

Work from home Full-time role

Experienced Customer Service Associate – Retail Grocery Location

Work from home Full-time role

Experienced Remote Live Chat Agent – Customer Support Specialist (Part-Time & Full-Time) at arenaflex

Work from home Full-time role

Tax Managing Director | ASC 740 SME - National Tax Office

Work from home Full-time role

Recruiter Coordinator-Northshore/New Orleans

Work from home Full-time role