Senior Site Reliability Engineer (CloudVision as a Service)
Requirements
- BS/MS degree in Computer Science or a relevant experience subject,
- 5+ years software engineering experience,
- Experience developing or managing deployments of distributed database systems or scale out applications for a SaaS environment,
- Proficiency in Python, Golang, and/or other languages. Expected to be comfortable in Bash and/or other scripting languages
What the job involves
- We’re looking for Site Reliability Engineers to join our growing Arista’s CloudVision-as-a-Service (CVaaS) global SRE team,
- SREs at Arista combine strong software engineering background, systems architecture knowledge, with passion for operating production systems at scale,
- We are responsible for our global CloudVision service fleet, ensuring scalability, reliability, and stability,
- You’ll have firsthand experience in being part of a rapidly growing product with a passionate group of engineers that unapologetically put product reliability and customer experience first,
- We deeply believe in building highly automated and self-sustaining environments, prioritizing safe and efficient operations that leverage cutting edge technologies and tools,
- Arista’s CloudVision is an enterprise network management and streaming telemetry SaaS offering,
- CloudVision stack is built entirely Kubernetes-native,
- Familiarity with GCP (Google Cloud Platform) and GKE (Google Kubernetes Engine) is preferred,
- Our technical stack includes but not limited to: Golang, Python, Ansible/Pulumi, Bash,
- You will be expected to develop, operate, and work with many different types of databases, both directly on Kubernetes or leveraging managed DB products,
- We integrate with many different Open Source Software (OSS) projects that both power our microservices stack, monitoring infrastructure, and much more,
- As an SRE you’ll have the chance to be drive, develop, and lead projects in any of the following areas:,
- Data Platform (NetDL) Architecture and Performance,
- Capacity Planning,
- Autoscaling,
- Disaster Recovery,
- Observability,
- Change Management - CI/CD,
- Service Network Architecture,
- Cost Optimizations,
- Instructure and Cloud-First Application Security,
- You will also be joining globally distributed, “follow the sun model” on-call team where you’ll:,
- Continuously improve operational processes by adding automation,
- Leading sustainable incident response and blameless postmortems
Apply tot his job Apply To this Job