[Remote] Senior Site Reliability Engineer, Core AI Infrastructure
Note: The job is a remote job and is open to candidates in USA. Coinbase is a leading company focused on increasing economic freedom through innovative financial solutions. They are seeking a Senior Site Reliability Engineer to join their IT Operations team, responsible for ensuring the reliability and automation of critical AI infrastructure while driving AI transformation within the company.
Responsibilities
- Own the reliability, monitoring, and incident response lifecycle for AI infrastructure services, including on-call support for AWS deployment pipelines, root cause analysis, and blameless retros
- Build automation and tooling to streamline operational IT workflows, eliminate manual tasks, and improve deployment velocity across CI/CD frameworks and Kubernetes environments
- Partner with the Coinbase Infrastructure team to extend CI/CD frameworks supporting IT services and enterprise network platforms, and with Security and Compliance to integrate surveillance tooling into deployment pipelines
- Strengthen observability and documentation standards across IT engineering by defining metrics, implementing monitoring solutions, and maintaining technical documentation that sets a standard of excellence
- Develop full-stack applications that power internal AI products and infrastructure with Go or Python
Skills
- 5+ years of experience automating and supporting cloud infrastructure (AWS) and network environments, with hands-on use of infrastructure-as-code tools (Terraform, Ansible, Chef, Puppet, or Salt)
- Proven experience deploying, managing, and troubleshooting containerized workloads using Docker and Kubernetes in production environments
- Proficiency in at least one scripting or programming language (Python, Bash, Ruby, or Go) and version control workflows using Git-based CI/CD pipelines
- Track record of leading incident response in environments with strict SLAs, including root cause analysis, blameless retros, and measurable reliability improvements
- Utilizes generative AI responsibly, maintaining human oversight to deliver business-ready outputs and drive measurable improvements in workflow efficiency, cost, and quality
- Expertise with linux, bash, ruby, python and/or go
- Expertise automating EC2 or containers deployment with terraform
- Strong network security fundamentals
- Experience managing and leveraging log aggregation
- Experience working in a highly regulated environment
- Experience in a fast-paced, high-growth company
- Experience in a Remote-first IT environment
Benefits
- Total compensation may also include equity and bonus eligibility, and benefits (medical, dental, vision, 401(k))
Company Overview