See all roles

Site Reliability Engineer – AI & ML Infrastructure, Kubernetes, AWS, Terraform

Work from home Full-time role Hiring

Job Description:

  • Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services.
  • Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated.
  • Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources.
  • Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing.
  • Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments.
  • Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning.
  • Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle.
  • Automate the life cycle of single-tenant, managed deployments

Requirements:

  • 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE)
  • Proven, hands-on experience building and managing production infrastructure with Terraform
  • Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment
  • Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads
  • Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management
  • Strong scripting and automation skills (e.g., Python, Go, Bash)

Benefits:

  • Medical, dental, vision benefits
  • Annual wellness stipend
  • Mental health support
  • Life, STD, LTD Income Insurance Plans
  • Unlimited PTO
  • Generous paid parental leave
  • Flexible schedule
  • 12 Paid US company holidays
  • Quarterly personal productivity stipend
  • One-time stipend for home office upgrades
  • 401(k) plan with company match
  • Tax Savings Programs
  • Learning / Education stipend
  • Participation in talks and conferences
  • Employee Resource Groups
  • AI enablement workshops / sessions

Apply tot his job Apply To this Job

You might like

Sr. Site Reliability Engineer. Cloud

Work from home Full-time role

Site Reliability Engineer (Splunk, Prometheus, Grafana) Hybrid

Work from home Full-time role

Platform Site Reliability Engineer:

Work from home Full-time role

Site Reliability Engineer 2 days Onsite

Work from home Full-time role

Senior Site Reliability Engineer — Token Factory (Inference Platform)

Work from home Full-time role

Senior Site Reliability Engineer

Work from home Full-time role

Site Reliability Engineer/Sunnyvale, CA/ Austin, TX (Hybrid)- 6-12 months

Work from home Full-time role

Senior Site Reliability Engineer (CloudVision as a Service)

Work from home Full-time role

Site Reliability Engineer Manager

Work from home Full-time role

Site Reliability Engineer: initial focus on Release Management

Work from home Full-time role

Buyer/Planner, Solar and BESS Equipment

Work from home Full-time role

Senior FullStack Developer

Work from home Full-time role

Job Title: Remote Data Entry Specialist – Entry Level Part-Time Position | Work From Home Opportunity With Flexible Hours

Work from home Full-time role

Experienced Remote Data Entry Specialist (Typist) – High Accuracy and Confidentiality Required

Work from home Full-time role

Experienced Entry-Level Data Entry Clerk – Remote Opportunity with arenaflex

Work from home Full-time role

Experienced Customer Service Representative 2 – BMV Branch Operations and Driver Services

Work from home Full-time role

Experienced Data Entry Specialist – Remote Opportunity with arenaflex

Work from home Full-time role

Sales Executive Tissue Shelby, NC

Work from home Full-time role

Principal DevOps Consultant

Work from home Full-time role

Sr. Real Estate Counsel - Renewables

Work from home Full-time role