See all roles

Business Analyst IV - Alert Management & Observability Standards Lead

Work from home Full-time role Hiring

What this Job Entails: The Business Analyst IV will provide solutions that help attain business outcomes. The Alert Management & Observability Standards Lead is responsible for rationalizing and governing all system alerts to ensure they align with department priorities, operational coverage models, and service reliability goals. This role defines alerting standards, reviews and approves alerts before they are routed to the 24x7 Eyes-on-Glass Operations team, and establishes a scalable approach to cataloging alert response instructions (runbooks/playbooks) so responders can take consistent, high-quality actions. This position operates at the intersection of the IT Operations Command Center (OCC), engineering/application teams, platform/monitoring tool owners, and service owners, ensuring alerts are actionable, prioritized, and paired with clear response guidance. Your Roles and Responsibilities: 1) Alert Rationalization & Prioritization (Core) Establish and maintain a department-wide alert rationalization framework that evaluates alerts for:

  • Business/service criticality and operational priority
  • Actionability (clear operator action available)
  • Signal-to-noise (duplicate/low-value alerts removed or suppressed)
  • Ownership and escalation paths

Perform regular alert reviews (new + existing) to ensure alert quality, correct routing, and alignment with operational coverage. Lead continuous improvement efforts to reduce alert fatigue while preserving detection of true incidents and high-impact degradation. 2) Standards, Policies, and Guardrails Define and enforce alerting standards including:

  • Severity definitions and thresholds
  • Required metadata (service, CI, owner, runbook link, escalation)
  • Naming conventions and tagging taxonomy
  • Routing rules and “when to page vs. when to ticket”

Create a standardized Alert Design Checklist and approval workflow (e.g., “Definition of Done” for alert onboarding). Partner with tool/platform owners to ensure standards are embedded in monitoring tooling (templates, required fields, automated validation). 3) Routing Decisions to 24x7 Eyes-on-Glass Act as gatekeeper (or lead the governance process) for determining which alerts should:

  • Go to 24x7 Eyes-on-Glass for immediate triage
  • Route to on-call engineering directly
  • Create tickets for business-hours handling
  • Be suppressed, aggregated, or converted to dashboards/health indicators

Ensure routing aligns with:

  • Operational responsibilities and skills of the Eyes-on-Glass team
  • Department priorities (e.g., safety, reliability, customer impact)
  • Service ownership and support models

4) Runbook / Response Instruction Cataloging (Knowledge System) Establish a consistent approach to cataloging response instructions for every actionable alert, including:

  • “What does this alert mean?” (symptoms + impact)
  • “What to check first” (triage steps)
  • “What actions to take” (standard remediation)
  • “When to escalate and to whom” (clear escalation triggers)
  • Links to dashboards, logs, SOPs, and known issues

Own the runbook template and ensure runbooks are versioned, maintained, and reviewed on a defined cadence. Partner with service owners to ensure runbooks stay current as systems change. 5) Reporting & Operational Outcomes Define and publish KPIs that demonstrate alerting health and operational performance, such as:

  • Alert volume trends by service and severity
  • Percentage of alerts with runbooks and valid ownership
  • Alert “actionability rate” and noise reduction
  • Mean time to acknowledge / triage effectiveness (as applicable)

Facilitate governance forums (weekly/monthly) with service owners and engineering leads to review alert quality and backlog. 6) Cross-Functional Enablement Coach service teams on best practices: SLIs/SLOs, alert thresholds, dependency monitoring, and incident correlation. Drive adoption of observability patterns (golden signals, health indicators, multi-signal alerting). Support major incident learning by feeding post-incident insights back into improved alerts and runbooks. 7) Able to Deliver the following in the first 45 days: Alerting standards (severity model, metadata, naming, routing policy) published and adopted Intake and approval workflow established for new/changed alerts Top 20 noisy services rationalized (dedupe/suppress/threshold tuning) with measurable noise reduction Runbook template launched; minimum runbook coverage targets set (e.g., 80% of paged alerts) Central alert catalog created (ownership + routing + runbook link + last review date) Required Qualifications/Skills: 5+ years in IT Operations, SRE, Observability, Monitoring Engineering, or Incident Management Demonstrated success reducing noise and improving actionability across enterprise alerting ecosystems Experience with common monitoring/observability tools (e.g., Splunk, AppDynamics, Dynatrace, Datadog, Prometheus/Grafana, Azure Monitor, CloudWatch, ServiceNow Event Mgmt or similar) Strong understanding of:

  • Incident response workflows and operational coverage models (24x7 vs. business hours)
  • CMDB/service ownership concepts and dependency mapping
  • Standard operating procedures/runbooks and knowledge management

Excellent stakeholder management and ability to drive standards across teams Preferred Qualifications:

  • Experience designing or operating an Operations Command Center / NOC / SOC-style “eyes-on-glass” model
  • Familiarity with ITIL Event Management, SRE principles, and service reliability practices
  • Experience with automation for alert enrichment, correlation, and routing (e.g., event correlation, deduplication, noise suppression)
  • Background in governance frameworks and operating rhythm design (cadences, controls, compliance traceability)

Physical Demand & Work Environment:

  • Must have the ability to perform office-related tasks which may include prolonged sitting or standing
  • Must have the ability to move from place to place within an office environment
  • Must be able to use a computer
  • Must have the ability to communicate effectively
  • Some positions may require occasional repetitive motion or movements of the wrists, hands, and/or fingers

Apply tot his job Apply To this Job

You might like

Business Analyst IV for Remote work

Work from home Full-time role

Technical Writer / Documentation Specialist(Remote)

Work from home Full-time role

Remote LV Tech Writer

Work from home Full-time role

Business Analyst III (Application Analyst – Asset Management) - 181773

Work from home Full-time role

Technical Writer – API Documentation

Work from home Full-time role

NetSuite ERP and Coupa Business analyst

Work from home Full-time role

Business Analyst (MDM, PAYER, FHIR, HL7 and EDI experience)

Work from home Full-time role

User Experience/Product Designer

Work from home Full-time role

UX Designer (REMOTE)

Work from home Full-time role

UI/UX Designer – Augmented Reality (AR) Smart Glasses Application

Work from home Full-time role

Experienced Customer Support Specialist – Information Systems Specialist 4 at arenaflex

Work from home Full-time role

Experienced Customer Service Representative – Outdoor Products and Technology Support

Work from home Full-time role

SEO + AIO Audit and Roadmap for B2B Technology Website

Work from home Full-time role

Python Developer: Real-Time Intelligent Communication Systems

Work from home Full-time role

Experienced Data Entry Coordinator – Administrative Support & Data Management

Work from home Full-time role

Experienced Customer Support Specialist – Data Entry and Call Support for arenaflex in the USA

Work from home Full-time role

Contracted In-Home Occupational Therapist

Work from home Full-time role

Experienced Full Stack Data Entry Clerk – Customer Support and Administrative Assistant

Work from home Full-time role

Threat Intelligence

Work from home Full-time role

Adjunct Faculty, Construction Technology

Work from home Full-time role