Incident Management Analytics

Overview

Use Incident Analytics to learn from past incidents and understand the efficiency and performance of your incident response process. Incident analytics allows you to pull aggregated statistics on your incidents over time. You can use these statistics to create reports that help you to:

  • Analyze whether your incident response process is improving over time
  • Assess your mean time to resolutions
  • Identify areas of improvement that you should invest in

Data collected

Incident Management Analytics is a queryable data source for aggregated incident statistics. You can query these analytics in a variety of graph widgets in both Dashboards and Notebooks to analyze the history of your incident response over time. To give you a starting point, Datadog provides the following out-of-the-box resources that you can clone and customize:

Incident timestamps

Incidents carry three timestamp attributes that influence analytics:

  • Declaration time (declared): When the incident was declared.
  • Detection time (detected): When the underlying resource from which the incident was declared was created. For example, if a monitor alert fires at 2 p.m. and the incident is declared at 2:30 p.m., the detected time is 2:30 p.m. If the incident wasn’t declared from another Datadog resource, detected is the same as declared.
  • Resolution time (resolved): When the incident was most recently resolved.

Measures

Incident Management reports the following analytic measures, which you can use to power analytic queries in Dashboard and Notebook widgets:

  • Customer Impact Duration: The duration during which customers were impacted, based on the impacts defined on the incident.
  • Status Active Duration: The duration that the incident was in an “active” state, based on the incident timeline.
  • Status Stable Duration: The duration that the incident was in a “stable” state, based on the incident timeline.
  • Time to Detect: The duration from the earliest customer impact to the incident’s detection time.
  • Time to Repair: The duration from the incident’s detection time to the last customer impact.
  • Time to Resolve: The duration from the incident’s declaration time to the time it was resolved.

In addition to these defaults, you can create new measures by adding custom Number property fields in your Incident Settings.

Note: If you override a timestamp, the override value will be used to calculate Time to Detect, Time to Repair, and Time to Resolve.

Visualize incident data in dashboards

To configure your graph using Incident Management Analytics data, follow these steps:

  1. Select your visualization.
  2. Select Incidents from the data source dropdown menu.
  3. Select a measure from the yellow dropdown menu.
    • Default Statistic: Counts the number of incidents.
  4. Select an aggregation for the measure.
  5. (Optional) Select a rollup for the measure.
  6. (Optional) Use the search bar to filter the statistic down to a specific subset of incidents.
  7. (Optional) Select a facet in the pink dropdown menu to break the measure up by group and select a limited number of groups to display.
  8. Title the graph.
  9. Save your widget.

Example: Weekly outage customer impact duration grouped by service

Timeseries graph configuration showing Incidents data source filtered by severity, showing the customer impact duration grouped by service

This example configuration shows you an aggregation of your incidents that are SEV-1 or SEV-2. The graph displays the Customer Impact Duration of those incidents grouped by service.

  1. Widget: Timeseries Line Graph
  2. Datasource: Incidents
  3. Measure: Customer Impact Duration
  4. Aggregation: avg
  5. Rollup: 1w
  6. Filter: severity:("SEV-1" OR "SEV-2")
  7. Group: Services, limit to top 5

Incident report

Use the out-of-the-box Notebook template to create the Incident Report or build one from scratch to get a summary report of incidents in your team or service.

  1. Open the Incident Report template.
  2. Click Use Template to edit and customize.
  3. You can use the existing Incident cells or customize the query to display values for each measure.
  4. Update the summary cells with the relevant values and share the report with the rest of your team.

Further reading