Dynatrace AI 2.0: Root Cause Analysis
Dynatrace AI 2.0: Root Cause Analysis
Agenda
• Artificial Intelligence
• AI 2.0
• Problem Analysis
• Anomaly Detection
• Alerting Profiles
• Problem Notifications
• Maintenance Windows
• Dashboards
• View/Modify/Create/Delete Dashboards
• Dashboard Management
Confidential 2
Artificial Intelligence
Confidential 3
Smart Baselines & Problems
Confidential 4
Smart Baselines & Problems
Confidential 5
Smart Baselines & Problems
Confidential 6
Smart Baselines & Problems
Confidential 7
Smart Baselines & Problems
Confidential 8
Intelligent anomaly detection
Confidential 9
Fully automatic multidimensional baselining
• Multidimensional baselining
Confidential 10
Fully automatic multidimensional baselining
• Dynatrace monitors the median and the 90th percentile (the slowest 10% of all callers)
Confidential 11
Anomaly Detection
• Dynatrace continuously measures incoming traffic levels against defined thresholds to determine
when a detected slowdown or error-rate increase justifies the generation of a new problem event
• Rapidly increasing response-time degradations for applications and services are evaluated based on
sliding 5-minute time intervals
• Slowly degrading response-time degradations are evaluated based on 15-minute time intervals
Confidential 12
Automatic detection of topology and communication relation
Confidential 13
Intelligent correlation of events and degradations
Confidential 14
Causation is key to detect the root cause of problems!
• The ranking decides about the probability of root cause candidates, shown on the right.
Confidential 15
Topology information is key to scalability
• A scalable and generic approach that is able to handle huge problem sizes, as shown below.
Confidential 16
Frequent issue detection
• Hypothesis: There are unhealthy situations that are normal for ops
• Unimportant disk is full since several weeks
• Regular backup process triggers CPU spikes Normal Abnormal
• Dynatrace detects such regular events
Healthy
• On basis of a daily and weekly moving window
good …?
• Once discovered, only notify user if severity increases
Unhealthy
• One week without frequent issue means reset to start
frequent? bad
Confidential 17
Vision and outlook: Automatic metric change point detection for causation
• Detect all topology relevant metrics and automatically detect the change points
Confidential 18
Artificial Intelligence in Dynatrace
• https://www.dynatrace.com/news/blog/artificial-intelligence-dynatrace/
Confidential 19
AI 2.0
Confidential 20
AI 2.0
• Every Dynatrace environment will have the choice to opt-into our 2nd generation AI root-cause
analysis
Confidential 21
Dynatrace AI 1.0
Confidential 23
What‘s our solution?
• OneAgent is collecting more than 500 different metrics (CPU, memory, network,
process metrics and service metrics) in average
• The health state of each monitored component is derived from those metrics either by
comparing the baseline to the current value or by checking a threshold
• The 2nd generation root-cause engine introduces a completely different approach that
is no longer depending on a baseline or a threshold to detect the root-causes in
complex situations
Confidential 24
Dynatrace AI 2.0 AI presents all findings within
the new root-cause section
Confidential 25
Dynatrace AI 2.0 – Detect root-causes in custom metrics and events
• Plugins for third-party integrations can represent a great resource for additional root-cause
information
• An example here is the tight integration into your continuous integration and deployment toolchain
that provides information about recent rollouts, responsible product owners and possible
remediation actions
• The new analysis covers both information ingest, custom metrics as well as custom events sent from
third-party integrations
Confidential 26
Dynatrace AI 2.0 – Detect root-causes in custom metrics and events
Confidential 27
Dynatrace AI 2.0 – Detect root-causes in custom metrics and events
Confidential 28
Dynatrace AI 2.0 – Smarter and more precise root-cause findings
Confidential 29
Dynatrace AI 2.0 – Smarter and more precise root-cause findings
• Another improvement within the new analysis is the detection of grouped root-causes
• The new analysis identifies root-cause candidates on group level to explain the overall situation,
such as a set of outliers within a large cluster of service instances
• While the problem details screen just shows a quick summary of the top contributors a click on the
drilldown ‘Analyze findings’ button opens a detailed analysis view
• This drill down view is organized to show an identified root-cause as a grouped vertical stack,
meaning the top layer always shows service findings followed by process group findings and finally
all host and infrastructure findings
Confidential 30
Dynatrace AI 2.0 – Smarter and more precise root-cause findings
Confidential 31
Problem Analysis
Confidential 32
What is a problem?
• A “problem” in Dynatrace includes the AI-driven analysis, environmental context, root cause analysis,
and other details provided for one or more incidents in your environment
Confidential 33
Problem pattern catalog
• Applications • Infrastructure
• 1.1 Unexpected high traffic • 4.1 CPU saturation
• 1.2 Unexpected low traffic • 4.2 Memory saturation
• 1.3 User action duration degradation • 4.3 Slow storage
• 1.4 Javascript error rate increase • 4.3 Insufficient queue depth
• 4.4 Overloaded storage
• Services
• 4.5 Network congestion
• 2.1 Response time degradation
• 4.6 Host or monitoring unavailable
• 2.2 Failure rate increase
• 4.7 High network utilization
• 2.3 Failed databases connects
• 4.8 Multiple infrastructure problems
• Synthetic only
• 3.1 Web check performance threshold violation
• 3.2 Web check availability error
Confidential 34
Events
• You’ll find events listed in the Events section on each individual host, ESXi host and service page
• The deployment of new software code is one type of event tracked by Dynatrace
• Such deployment events sometimes result in performance issues
• Dynatrace identifies such issues by tracking all events, including deployment events, and correlating them
with any discovered performance problems
• If we notice a drop in performance immediately following the deployment of new code or a system restart,
we will notify you and provide measurement comparisons of performance both before and after the
deployment
Confidential 35
Problem Example
• Following is an example scenario involving a problem that has as its root cause a performance
incident in the infrastructure layer:
• An infrastructure-level performance incident is detected. A new problem is created for tracking purposes
• After a few minutes the infrastructure problem leads to the appearance of a performance degradation
problem in one of the application’s services
• Additional service-level performance degradation problems begin to appear. So what began as an isolated
infrastructure-only problem has grown into a series of service-level problems
• Eventually the service-level problems begin to affect the user experience of your customers. At this point in
the problem life cycle you have an application problem with one root cause in the infrastructure layer and
additional root causes in the service layer.
• Because Dynatrace understands all the dependencies in your environment it correlates the performance
degradation problem your customers are experiencing with the original performance problem in the
infrastructure layer, thereby facilitating quick problem resolution
Confidential 36
Problems
• Such root cause analysis is available to every problem with more than one entity impacted
• Just view the “Root Cause” section on any problem detail page to see the analysis that
Dynatrace has performed on the problem
Confidential 37
How do I access Problems?
As a dashboard tile
Main menu
OR
Confidential 38
Problems Select a specific time
range if multiple problems
found within dashboard
timeframe
Open problems
marked in red and
resolved problems
show as grey.
Confidential 39
Measure stability improvements
• There are some limitations when viewing problems that are older
• This means that you must perform all detailed problem analysis and triage within 10 days of
problem detection
Confidential 40
Problem Filters Severity
Tags
Confidential 41
Problem Filters Status
Severity
Status
Impact
Alerting Profiles
Context
Tags
Confidential 42
Problem Filters Impact
• Infrastructure
Context • This level comprises the physical and/or virtual machines that serve your
application to your customers
Tags • This level includes the servers, databases, hosts, and processes running in
your environment
Confidential 43
Problem Filters Alerting Profiles
Status
Impact
Alerting Profiles
Context
Tags
Confidential 44
Problem Filters Context
Status
Impact
Alerting Profiles
Context
Tags
Confidential 45
Problem Filters Tags
Severity
Status
Impact
Alerting Profiles
Context
Tags
Confidential 46
Aspects of a Problem
Overview Root
cause
Business
Impact
Analysis
Impact
Visual
resolution
path
Comments
Confidential 47
Overview
Confidential 48
Business Impact Analysis
Confidential 49
Impact
Confidential 50
Root Cause
Confidential 51
Root Cause Drilldown
Confidential 52
Root Cause Drilldown
Confidential 53
Visual resolution path
• The Visual resolution path shows you the dependencies between your application and the
underlying services and infrastructure components that support it
Confidential 54
Problem Evolution Viewer
• Each Visual resolution path page includes a Problem evolution viewer that you can use to replay
the problem to see how it evolved over time
• Here you can see in great detail how your application’s dependencies interacted and performed
during the time leading up to and during the problem
• You can see which failed services calls or infrastructure health issues led to the failure of other
service calls and ultimately led to the performance problem that affects your customers’ experience
Confidential 55
Visual resolution path
Confidential 56
Comments
Confidential 57
Close Problem – With Reason
Confidential 58
Problems with logs
• Dynatrace captures preconfigured log messages and correlates them with any problems that it
detects in your environment
• Relevant log messages that are associated with problems are then factored into problem root-cause
analysis
Confidential 59
Problems with Web Checks
Confidential 60
Problems with Web Checks
Confidential 61
Analysis – Hands-on
Confidential 62
Anomaly Detection
Confidential 63
Thresholds
Confidential 64
Automatic baselining
• Service baselining calculates a reference value for the Service method dimension:
• Service method: A service’s individual service methods
Confidential 65
Automatic baselining
• Dynatrace application traffic anomaly detection is based on the assumption that most business
traffic follows predictable daily and weekly traffic patterns
• Alerting on traffic spikes and drops begins after a learning period of one week because baselining
requires a full week’s worth of traffic to learn daily and weekly patterns.
• Following the learning period, Dynatrace forecasts the next week’s traffic and then compares the
actual incoming application traffic with the prediction
• If Dynatrace detects a deviation from forecasted traffic levels that falls outside of reasonable
statistical variation, Dynatrace raises either an Unexpected low traffic or an Unexpected high traffic
problem
Confidential 66
Automatic baselining
• Summary:
• Baselines are evaluated within 5-min and 15-min sliding time intervals
• Automatic detection of reference values for response times, error rates and load
• A combination of 4 dimensions for applications and 1 dimension for services
• Baseline cube calculation is initially performed 2 hours after your application or service is first detected by
Dynatrace, and thereafter on a daily basis
• Applications and services have to run for at least 20% of a week before slowdown and error rate alerts are
raised
• Applications have to run for at least a full week before traffic spike and drops alerts are raised
• Slowdown events are detected for the median and 90th percentile
Confidential 67
Static Thresholds
• Built-in
• Dynatrace infrastructure monitoring is based on numerous built-in, predefined static thresholds
• These thresholds relate to resource contentions like CPU spikes, memory, and disk usage
• User-defined
• For applications and services, you can disable automatic baselining-based reference-value detection
anytime and switch to user-defined static thresholds
• If you set a static threshold for response time and error rate on an application or service level, events will be
raised if the static threshold is breached
• A slowdown event is raised if the static thresholds for either the median or the 90th percentile response
times are breached
Confidential 68
Settings – Hands-on
Confidential 69
Alerting Profiles
Confidential 70
Alerting Profiles
• Alerting profiles allow you to control exactly which conditions result in problem notifications and which don’t
Confidential 71
Alerting Profiles
• Or:
• Predefined event types
• String based event filters
Confidential 72
Problem Notifications
Confidential 73
Problem Notifications
• Dynatrace offers several out-of-the-box integrations that enable you to automatically push problem
notifications to your preferred third-party incident management or ChatOps service
• Open problems are continuously updated based on evolving impact and correlating events
• To avoid notification spam, problem notifications are only pushed to third-party systems when
problems are initially detected and when they are ultimately resolved
Confidential 74
Problem Notifications
Confidential 75
Problem Notifications
Confidential 76
Problem Notifications
Confidential 77
Problem Notifications
Confidential 78
Problem Notifications
Confidential 79
Problem Notifications
Confidential 80
Maintenance Windows
Confidential 81
Maintenance Windows
• Once a maintenance window is defined, Dynatrace automatically excludes the configured time
period from its baseline calculations
Confidential 82
Maintenance Windows
Confidential 83
Maintenance Windows
Confidential 84
Maintenance Windows
• Once you’ve defined your maintenance windows, Dynatrace flags all problems that occur during
maintenance windows with a special maintenance (wrench and bolt) icon
• The Problems page filters now include an Under maintenance filter that enables you to view a list of
problems that occurred during maintenance windows
Confidential 85
Maintenance Windows
• If you open a problem that occurred during a maintenance window, Dynatrace shows a header on
the Problem page
Confidential 86
Maintenance Windows
• Even if you aren’t within a problem context and you select a global timeframe in which a selected
host was under maintenance, Dynatrace shows you the details on the Maintenance tile
Confidential 87
Questions?
Confidential 88
Dashboards
Confidential 89
Dashboards View Dashboards
View Dashboards
Modify Home
Tiles
Create a Custom
Dashboard
Custom Charts
Pin To Dashboard
Management
Confidential 90
Dashboards Modify Home
View Dashboards
Modify Home
Tiles
Create a Custom
Dashboard
Custom Charts
Pin To Dashboard
Management
Confidential 91
Dashboards Tiles
View Dashboards
Modify Home
Tiles
Create a Custom
Dashboard
Custom Charts
Pin To Dashboard
Management
Confidential 92
Dashboards Tiles
Confidential 93
Dashboards Custom Dashboards
Custom Charts
Pin To Dashboard
Management
Confidential 94
Dashboard Exercise
Confidential 95
Hands On: Custom Dashboard
• Steps
• Create a new dashboard called “Business Dashboard”
• Create a section called “User Activity” and add the following tiles:
• World Map, Live User Activity and User Behavior
• For World Map, add a second tile with a different metric
• Create a section called “Application Behavior” and add the following tiles:
• Top conversion goals, Bounce rate, Key user actions, and Conversion Goal
• For Goal, select one of your conversion goals
Confidential 96
Hands On: Custom Dashboard
Confidential 97
Hands On: Custom Dashboard
• Steps
• Create a new dashboard called “Operations Dashboard”
• Create a section called “Applications” and add the following tiles:
• Application Health (number version), Synthetic monitor health, Application Health (graph version), Resources – Load
Time
• Mobile App, Application, Browser monitor
• Create a section called “Services” and add the following tiles:
• Service health, Databases
• Create a section called “Infrastructure” and add the following tiles:
• Host health, Docker, Network Status, Network Metrics
Confidential 98
Hands On: Custom Dashboard
Confidential 99
Hands On: Custom Dashboard
• Steps
• Create a new dashboard called “Developers Dashboard”
• Create a section called “Services” and add the following tiles:
• Service health
• Select the top two services and add tiles for Service or request
• Create a section called “Databases” and add the following tiles:
• Select the top 2-4 database services and add tiles for Database performance
Confidential 10
0
Hands On: Custom Dashboard
Confidential 10
1
Dashboards Custom Charts
View Dashboards
Modify Home
Tiles
Create a Custom
Dashboard
Custom Charts
Pin To Dashboard
Management
Confidential 10
2
Dashboard Exercise
Confidential 10
3
Hands On: Custom Chart
• Steps
• Create a new dashboard called “Custom Charts”
• Create a section called “Services” and add 4 Custom Chart tiles
• Add the following metrics, one to each chart:
• Services-Requests
• Services-Response Time
• Databases-Requests
• Databases-Response Time
• Edit the title for each chart
Confidential 10
4
Hands On: Custom Chart
Confidential 10
5
Hands On: Custom Chart
• Steps
• Open the dashboard called “Custom Charts”
• Create a section called “Infrastructure” and add 4 Custom Chart tiles
• Add the following metrics, one to each chart:
• Hosts-CPU Usage
• Hosts-Memory Usage
• Hosts-Disk Usage
• Hosts-Network Traffic
• Edit the title for each chart
Confidential 10
6
Hands On: Custom Chart
Confidential 10
7
Dashboards Pin To Dashboard
View Dashboards
Modify Home
Tiles
Create a Custom
Dashboard
Custom Charts
Pin To Dashboard
Management
Confidential 10
8
Dashboard Exercise
Confidential 10
9
Hands On: Pin to Dashboard
• Steps
• Create a new dashboard called “Pin Tiles”
• Create a section called “Charts” and save the dashboard
• Open the Hosts view and select the Chart icon at the top
• Apply filters to the view and select “Pin to dashboard“
• Do the same for Services and Databases
Confidential 11
0
Hands On: Pin to Dashboard
Confidential 11
1
Hands On: Pin to Dashboard
• Steps
• Open the dashboard called “Pin Tiles”
• Create a section called “Services” and save the dashboard
• Open the Services view and drill into the Service with the highest “Requests”
• Click the “…” button and select “Pin to dashboard”
• Do the same for another Service and for the two Databases with the highest “Requests”
• Click on your first Service tile and drill down to an individual request
• Click the “…” button and select “Pin to dashboard”
Confidential 11
2
Hands On: Pin to Dashboard
Confidential 11
3
Hands On: Custom Dashboard
• Steps
• Create a new dashboard called “My Dashboard - {initials}”
• Use your initials or last name
• Take everything we learned and create a dashboard customized for you or your team
Confidential 11
4
Dashboards Management
View Dashboards
Modify Home
Tiles
Create a Custom
Dashboard
Custom Charts
Pin To Dashboard
Management
Confidential 11
5
Questions?
Confidential 11
6
dynatrace.com
Confidential 11
7