Troubleshoot high database load with AI assistance

This document describes how you can use AI assistance in Cloud SQL to troubleshoot high database load in Cloud SQL. You can use the AI assistance capabilities of Cloud SQL and Gemini Cloud Assist to investigate, analyze, obtain recommendations, and finally implement those recommendations to optimize your queries in Cloud SQL.

By accessing the Query insights dashboard in the Google Cloud console, you can analyze your database and troubleshoot events when your system experiences a higher database load than average. Cloud SQL uses the 24 hours of data that occurs prior to your selected time range to calculate the expected load of your database. You can look into the reasons for the higher load events and analyze the evidence behind reduced performance. Finally, Cloud SQL provides recommendations for optimizing your database to improve performance.

Before you begin

To troubleshoot high database load with AI assistance, do the following:

  1. Review limitations with AI-assisted troubleshooting.
  2. Enable AI-assisted troubleshooting.

Required roles and permissions

For the roles and permissions required to troubleshoot high database load with AI assistance, see Observe and troubleshoot with AI.

Use AI assistance

To use AI assistance with troubleshooting high database load, go to the Instance Overview page or the Query insights dashboard in the Google Cloud console.

Instance overview page

Troubleshoot high database load with AI assistance in the Instance overview page by using the following steps:

  1. In the Google Cloud console, go to the Cloud SQL Instances page.

    Go to Cloud SQL Instances

  2. To open the Overview page of an instance, click the instance name.
  3. In the Overview page, from the Chart menu, select a metric for the database. You can select any metric.
  4. Optional: To select a specific analysis time period, use the Time range filter to select either 1 hour, 6 hours, 1 day, 7 days, 30 days or a custom range .
  5. Database performance chart on the Overview page that shows
    CPU utilization over a 24 hour time period and an option to
    analyze instance performance.

    You can zoom in to specific sections of the chart where you notice areas of high load that you want to analyze. For example, an area of high load might display CPU utilization levels closer to 100%. To zoom in, click and select a portion of the chart.

    A graph where the user has clicked a data point to zoom in. An option to
    reset the zoomed graph appears.
  6. Click Analyze instance performance to start troubleshooting high database load with AI assistance. This generates the Analyzing database load page.

Query insights dashboard

Troubleshoot high database load with AI assistance in the Query insights dashboard using the following steps:

  1. In the Google Cloud console, go to the Cloud SQL Instances page.

    Go to Cloud SQL Instances

  2. To open the Overview page of an instance, click the instance name.
  3. Click Query insights to open the Query insights dashboard.
  4. Optional: Use the Time range filter to select either 1 hour, 6 hours, 1 day, 7 days, 30 days or a custom range.
  5. Database Load chart on the Query insights chart that shows query
    latency over a 24 hour time period and an option to analyze instance performance.

    You can zoom in to specific sections of the chart where you notice areas of higher database load by query execution time. To zoom in, click and select a portion of the chart.

  6. In the Database load chart, click Analyze instance performance to start troubleshooting high database load with AI assistance. This generates the Analyzing database load page.

Analyze high database load

Using AI assistance, you can analyze and troubleshoot the details of your database load.

In the Analyzing database load page, you can view following details for your Cloud SQL instance:

  • Analysis time period
  • CPU utilization (p99)
  • Memory utilization (p99)

Cloud SQL displays a Data Transfer In/Out bytes chart where you can look at data transfer activity for the selected time period. You can check for sudden spikes in data transfer activity for a particular time period.

Analyzing database performance page for Cloud SQL for SQL Server

Analysis time period

Cloud SQL analyzes your database for the time period that you select in your database load chart from the Query insights dashboard or the Instance overview page. If you select a time period of less than 24 hours, then Cloud SQL analyzes the entire time period. If you select a time period greater than 24 hours, then Cloud SQL selects only the last 24 hours of the time period for analysis.

To calculate the baseline performance analysis of your database, Cloud SQL includes 24 hours of a baseline time period in its analysis time period. If your selected time period occurs on a day other than Monday, then Cloud SQL uses a baseline time period of the 24 hours previous to your selected time period. If your selected time period occurs on a Monday, then Cloud SQL uses a baseline time period of the 7th day previous to your selected time period.

Situation

When Cloud SQL starts the analysis, Cloud SQL checks for significant changes in the following key metrics:

  • Queries per second (QPS)
  • CPU
  • Memory
  • Disk I/O

Cloud SQL compares the baseline aggregated data for your database within the performance data of your analysis time window. If Cloud SQL detects a significant change in threshold for a key metric, then Cloud SQL indicates a possible situation with your database. The identified situation might explain a root cause for the high load on your database over the selected time period.

For example, you might see multiple situations identified for why your database is experiencing high load such as:

  • Change in database load
  • Hardware resource contention
  • TempDB contention

Situation and evidence for Cloud SQL for SQL Server

Evidence

For each situation, Cloud SQL provides a list of evidence to support the finding. Cloud SQL bases evidence on metrics gathered from the instance.

Each situation has supporting evidence that's used to detect anomalies in system performance. Cloud SQL detects an anomaly when system performance surpasses certain thresholds or meets specific time-bound criteria. Cloud SQL defines these thresholds or criteria for each situation.

To support the situation of Hardware resource contention, you might see the following evidence:

  • CPU utilization: Consistently high CPU utilization detected.
  • Server level waits: High CPU server waits exists.

To view the evidence retrieved during analysis, click each situation. The evidence appears in the pane next to its corresponding situation.

Recommendations

Based on all of the situations analyzed, Cloud SQL provides you with one or more actionable recommendations to help remediate the problems of your high database load. Cloud SQL presents the recommendations with a cost-benefit analysis so you can make an informed decision on whether to implement the recommendation.

For some situations, based on the analysis, there might not a recommendation.

Table of recommendations

For example, you might receive the following recommendations:

  • Query and workload tuning: Tune your workload and query, and adjust maximum worker threads based on best practices.
  • Database maintenance: Ensure that statistics are being updated regularly as part of any database maintenance. Along with this, regularly tune top queries by CPU or add more CPU if needed.

For some recommendations, you can click the Learn more link to read supporting documentation for the recommendation.

If you want to continue troubleshooting or get more assistance with system performance, then you can also open Gemini Cloud Assist.

What's next