See all roles

[Remote] Executive Director, AI Infrastructure & Platform Engineering

Work from home Full-time role Hiring

Note: The job is a remote job and is open to candidates in USA. CVS Health is dedicated to shaping a more connected and compassionate health experience. They are seeking an Executive Director for AI Infrastructure & Platform Engineering, responsible for leading the development and operational excellence of their AI compute platform, ensuring high availability and reliability for frontier AI workloads.

Responsibilities

  • Define and execute the long-range vision and strategy for AI infrastructure and platform engineering, with availability (>99.99%), reliability, and platform performance as the primary measures of success
  • Recruit, hire, develop, and retain a high-performing engineering organization spanning infrastructure, network, platform reliability, observability, security, 24/7 operations, change and release management, and FinOps
  • Establish clear ownership, accountability, and performance expectations across all functional teams; foster a culture of operational excellence, engineering rigor, and continuous improvement
  • Provide executive-level communication to senior leadership on platform status, milestones, risk posture, and strategic initiatives
  • Own the physical layer of the AI compute environment — GPU compute, storage, network fabric, capacity planning, and hardware lifecycle accountability
  • Direct bare-metal Kubernetes and OpenShift operations, including cluster administration, GPU quota governance, infrastructure-as-code adoption, and availability baseline enforcement
  • Govern high-performance network fabric operations — RoCE v2, spine-leaf topology, lossless Ethernet tuning, congestion management, and segmentation
  • Establish and enforce operational baselines across every layer of the stack — hardware, fabric, platform, and workload — with deviations detected, escalated, and resolved within defined SLAs
  • Direct Innovation POD strategy to develop self-healing and autonomous capabilities that proactively prevent service degradation before it impacts availability
  • Build and sustain a high-performing 24/7 operations model — designed for sustainable, predictable coverage with no mandatory overtime and measurable team health and retention
  • Drive end-to-end observability across the physical and platform layers, with continuous feedback loops connecting monitoring data to incident response, change decisions, and improvement cycles
  • Oversee change management so every modification is risk-assessed, monitored during rollout, and baseline-validated post-deployment
  • Ensure configuration consistency and drift detection across all platform components to prevent baseline degradation over time
  • Lead GPU FinOps governance — utilization optimization, tenant quota enforcement, and cost reduction — in partnership with the Finance organization
  • Empower the Security SRE Lead to maintain a world-class security posture across the infrastructure and platform layers, with robust compliance to frameworks including HIPAA and NIST AI RMF
  • Govern access controls, audit logging, vulnerability management, and network segmentation across the AI compute environment
  • Lead the operational transition from program-launch staffing to permanent CVS-owned operations — governing phased handoffs, competency validation, and milestone sign-offs to ensure minimal disruption to platform availability and business operations
  • Establish and lead the long-term operating model by institutionalizing key technical, architectural, and delivery leadership capabilities into permanent CVS roles, ensuring the organization is fully self-sustaining at program close
  • Own vendor relationships, contract performance, and accountability across the hardware, networking, platform, and managed-services stack
  • Manage budget ownership for the AI infrastructure and platform engineering organization, including capital planning and operational expense governance

Skills

  • 10+ years of engineering leadership experience, with substantial time directly owning physical infrastructure at data center scale — including hardware lifecycle, capacity planning, and facility coordination (power, cooling, rack-and-stack execution)
  • Hands-on production ownership of bare-metal Kubernetes or OpenShift. Managed cloud services (EKS, GKE, AKS) alone do not substitute for the practitioner expertise this role requires
  • Fluency with high-speed cluster fabrics — RoCE v2, InfiniBand, EVPN-VXLAN, or carrier-grade equivalent — and the operational discipline these fabrics require (PFC, ECN, lossless tuning, congestion management)
  • 5+ years leading multiple technical teams simultaneously, including 24/7 operations organizations, with measurable team health, retention, and performance outcomes
  • Proven success establishing and enforcing operational baselines, SLO / SLI / error-budget frameworks, and observability-driven continuous improvement in physical-infrastructure-anchored environments
  • Hardware lifecycle, vendor accountability, and facility coordination experience — including capacity planning, RMA management, and multi-vendor escalation
  • Experience leading operational transitions or organizational build-outs at scale, with business continuity and minimal disruption as non-negotiables
  • Executive-level stakeholder communication, vendor negotiation, and budget ownership
  • Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or related technical field
  • Hands-on experience with Cisco UCS, NVIDIA HGX / DGX / Blackwell systems, and VAST or comparable distributed NVMe storage
  • Direct experience operating GPU clusters of 32 or more GPUs in production environments — including HPC, AI training, research computing, or comparable workloads
  • NVIDIA AI Enterprise, NVIDIA Run:AI, NVIDIA Base Command Manager, or comparable GPU orchestration platform experience
  • Healthcare or other regulated-industry background (HIPAA, NIST AI RMF, SOX, FedRAMP, ITAR)
  • Chaos engineering and AI-driven operations experience — predictive alerting and automated remediation patterns
  • Background in innovation programs, POD structures, or centers of excellence

Benefits

  • This position is eligible for a CVS Health bonus, commission or short-term incentive program in addition to the base pay range listed above.
  • This position also includes an award target in the company’s equity award program.
  • This full‑time position is eligible for a comprehensive benefits package designed to support the physical, emotional, and financial well‑being of colleagues and their families.
  • The benefits for this position include medical, dental, and vision coverage, paid time off, retirement savings options, wellness programs, and other resources, based on eligibility.

Company Overview

  • CVS Health is a health solutions company that provides an integrated healthcare services to its members. It was founded in 1963, and is headquartered in Woonsocket, Rhode Island, USA, with a workforce of 10001+ employees. Its website is https://www.cvshealth.com/.
  • Apply To This Job

    You might like

    [Remote] Senior Director, Property & Operations Analytics (Eastern or Central Time Zones)

    Work from home Full-time role

    [Remote] Java Software Engineer - Remote

    Work from home Full-time role

    [Remote] Java Software Engineer - Remote

    Work from home Full-time role

    [Remote] Field Recruiter - West

    Work from home Full-time role

    [Remote] Field Recruiter - Texas

    Work from home Full-time role

    [Remote] Field Recruiter - Central

    Work from home Full-time role

    [Remote] Senior Analytics Engineer (Data + BI) — Healthcare

    Work from home Full-time role

    [Remote] Sales Consultant, Home Equity

    Work from home Full-time role

    [Remote] Account Executive

    Work from home Full-time role

    [Remote] Accounts Receivable Specialist

    Work from home Full-time role

    Technical Pricing & Execution Lead (Actuarial), Programs North America

    Work from home Full-time role

    Experienced Full-Time Remote Data Entry Specialist – Leverage Your Skills from Home with arenaflex

    Work from home Full-time role

    Remote Crisis Chat & Text Specialist – Compassionate Suicide Prevention, Emotional Support, and Community Resource Navigation (Non‑Profit Mental Health Services)

    Work from home Full-time role

    Virtual Tutor - Flexible Schedule & Custom Rates

    Work from home Full-time role

    Android Developer job at Stratacuity in Plano, TX

    Work from home Full-time role

    Experienced Customer Support Representative – Work From Home Opportunity at arenaflex

    Work from home Full-time role

    Virtual Client Service Rep (Fully Remote)

    Work from home Full-time role

    [Work From Home] Materials Handler (6 month register)

    Work from home Full-time role

    Home-based Data Entry Specialist – Travel Department

    Work from home Full-time role

    Human Resources Advisor

    Work from home Full-time role