See all roles

Site Reliability Engineer

Work from home Full-time role Hiring

Role Overview This is a key role focused on improving the reliability, availability, performance, and operational maturity of Grubtech's production systems. This individual will manage and improve AWS-based cloud environments, including ECS-based workloads, strengthen monitoring, alerting, logging, and observability capabilities, and support effective incident management for mission-critical workloads. The role will partner closely with application, DevOps, infrastructure, and support teams to prevent incidents, respond quickly when issues occur, improve production readiness, and reduce operational toil through automation and continuous improvement. Profile:

  • Bachelor’s degree in computer science, Software Engineering or related field.
  • Minimum 5 years of hands-on experience in Site Reliability Engineering, DevOps, cloud platform

engineering, infrastructure operations, or production engineering.

  • Strong hands-on experience operating, troubleshooting, and improving production workloads in

AWS; Azure or on-prem deployments would be an added advantage.

  • Experience with core AWS services and production operations, including VPC, EC2, ECS, IAM, Load

Balancers, CloudWatch, RDS, Security Groups, and related cloud services.

  • Hands-on working experience with Datadog is a must, including monitoring, alerting, application

performance monitoring, logging, dashboards, and service health visibility.

  • Ability to continuously improve existing Datadog dashboards, monitors, alert thresholds, and

operational views as services evolve and production needs change.

  • Experience managing and improving incident management capabilities, including incident triage,

escalation, communication, root-cause analysis, post-incident reviews, and follow-up actions.

  • Experience defining and improving reliability practices such as SLOs, SLIs, error budgets, runbooks,

playbooks, operational readiness checks, and on-call processes.

  • Experience troubleshooting distributed systems, AWS infrastructure, ECS workloads, networking,

databases, and application performance issues in production environments.

  • Experience in multiple scripting languages such as Python, Bash, PowerShell, JavaScript etc.
  • Experience with managed data platforms such as MongoDB Atlas, Confluent Cloud, Couchbase,

PlanetScale, ClickHouse, Redis, Postgres etc.

  • Experience supporting mission critical Linux systems at scale; Windows experience is optional but

good to have.

  • Experience supporting cloud networking DNS, Web Application Firewall, Security Groups,

Network Access Control List, load balancers etc.

  • Experience supporting containerized workloads using Docker and AWS ECS.
  • Expertise with cloud monitoring and management systems.
  • Experience with cloud security principles and best practices.
  • Familiarity with GitHub and GitHub Actions for managing CI/CD pipelines, release workflows, and

deployment automation.

  • Experience with monitoring and management tools such as Datadog, Prometheus, Grafana, ELK

etc.

  • Ability to analyze current technology and operational processes, then develop practical steps to

improve reliability, alert quality, scalability, and operational efficiency.

  • Willingness to participate in incident response and on-call support for production systems when

required.

  • Strong problem solving and analytical skills.
  • Strong English communication skills.
  • Ability to multitask, work well under pressure and prioritize work against competing deadlines

and changing business priorities. Apply To This Job

You might like

PAXUS System Expert

Work from home Full-time role

Site Activation Partner I - FSP

Work from home Full-time role

Sales Executive - TT - Mumbai

Work from home Full-time role

Marketing Campaign Manager - EMEA based

Work from home Full-time role

Client Director

Work from home Full-time role

AI-first Graphic Designer

Work from home Full-time role

Transfer Pricing Manager

Work from home Full-time role

Freiberufliche:r Physiotherapeut:in (m/w/d) – 100% remote

Work from home Full-time role

Angestellter Arzt (m/w/d)

Work from home Full-time role

ESaaS - SFDC - Project Management - Implementation & Transformation Services Delivery

Work from home Full-time role

Talent Intelligence Analyst

Work from home Full-time role

Inside Sales Supervisor

Work from home Full-time role

Climate Change: Clean Transportation Grants Specialist (Environmental Specialist 3)(In-Training)

Work from home Full-time role

Senior Learning & Development (L&D) Specialist

Work from home Full-time role

Experienced Full Stack Customer Service Representative – Live Chat Support for arenaflex

Work from home Full-time role

E-Learning Developer – Work from Home

Work from home Full-time role

Rewritten Job Title:

Work from home Full-time role

Job Title: Remote Part-Time Customer Service Representative - E-Commerce Customer Support Specialist (Teens Welcome)

Work from home Full-time role

Experienced Remote Customer Support Specialist – Amazon Chat Support Team

Work from home Full-time role

Senior Computer Vision Engineer - Object Detection, Tracking & Sports Video Analytics

Work from home Full-time role