[Remote] Staff Site Reliability Operations Engineer
Note: The job is a remote job and is open to candidates in USA. Calix is a company focused on enabling Communication Service Providers to transform and future-proof their businesses through a cloud-first, AI-powered platform. They are seeking a Staff Site Reliability Engineer to lead their global platform reliability and observability strategy on Google Cloud Platform, leveraging advanced technologies to build intelligent infrastructure and provide technical leadership.
Responsibilities
- Full-Stack Network Architecture: Architect, optimize, and troubleshoot complex networking infrastructure spanning Layer 1 through Layer 7, ensuring low-latency data transport, secure edge routing, and seamless service mesh integration
- Grafana Stack Architecture: Design, scale, and optimize our unified observability platform using the Grafana Labs suite (Grafana, Mimir, Loki, Tempo, and Beyla)
- AIOps & Intelligent Alerting: Deploy machine learning models and automated anomaly detection to cut through telemetry noise, reduce alert fatigue, and predict network or data pipeline bottlenecks
- GKE Platform Engineering: Drive the architecture, scaling, security, and networking of production Google Kubernetes Engine (GKE) clusters
- Data & Event Streaming Reliability: Tune, and maintain high-throughput Apache Kafka clusters to guarantee low-latency event delivery and high availability
- Large-Scale Database Management: Ensure the performance, scalability, and disaster recovery readiness of our transactional and analytical data tiers across PostgreSQL, AlloyDB, and BigQuery
- Automated Incident Response: Integrate AIOps insights with Grafana workflows to automate triage, accelerate root-cause analysis, and trigger auto-remediation scripts
- Technical Leadership: Champion the long-term technical roadmap for distributed infrastructure engineering and GCP cloud-native observability standards
- Mentorship: Coach senior and junior engineers on advanced debugging techniques, distributed systems thinking, and intelligent operations across a distributed workforce
Skills
- Proven track record of high autonomy and successful delivery in a 100% remote engineering environment
- 8+ years in SRE, Production Engineering, or Distributed Systems infrastructure roles
- Deep technical knowledge and debugging mastery across all OSI layers, including: L1-L3: Physical/fiber infrastructure awareness, switching, and advanced routing protocols (BGP, OSPF)
- Transport layer tuning (TCP congestion control algorithms, UDP, QUIC)
- Session management, TLS termination, DNS architecture, and advanced application protocols (HTTP/3, gRPC)
- Expert-level mastery of Google Kubernetes Engine (GKE) internals, custom controllers, multi-cluster networking, and GitOps workflows
- Proven track record managing high-throughput Apache Kafka pipelines and large-scale data environments across PostgreSQL, AlloyDB, and BigQuery
- Deep, hands-on experience deploying and managing Grafana Enterprise/Cloud, Prometheus/Mimir, Loki, and Tempo at scale
- Track record applying AI/ML techniques for time-series anomaly detection, log clustering, and correlation (e.g., Grafana Adaptive Metrics, BigPanda)
- Advanced, production-scale expertise utilizing HashiCorp Terraform exclusively to provision and manage multi-region GCP cloud architectures
- High proficiency in Go and Python for building custom infrastructure tooling, Kubernetes operators, and data integration scripts
- Exceptional written and verbal communication skills, with an emphasis on creating clear documentation for asynchronous alignment
- Deep knowledge of Google Cloud architectural best practices, Cloud SDN, Cloud Armor, Interconnect, Identity and Access Management (IAM), and cost optimization
- Deep understanding of Linux internals, eBPF-based monitoring, kernel-level networking, and packet analysis tools (Wireshark, tcpdump)
Benefits
- As a part of the total compensation package, this role may be eligible for a bonus.
Company Overview
Company H1B Sponsorship