DBA在日常工作中常常会面临以下两种常见情况:
- 业务人员会提出问题:“表被锁了,导致业务受阻,请帮忙解决。”
- 业务人员还会反馈:“某个程序通常几秒内就能执行完毕,但现在却运行了好几分钟,不清楚是什么原因?”
针对第二种情况,当我们排除了执行计划突然变化或数据量异常增加的可能性后,很大概率也是由锁造成的。
锁查询思路:区分锁类别
根本思路还是排查锁,只是区分排查当前的锁还是排查历史的锁。
下文以实验场景展示上述两个的分析过程。我使用的环境为:
observer (OceanBase 4.3.0.1)
REVISION: 101000062024032200-b59432e535c48e8b8828190c803b6c7736413ff9
BUILD_BRANCH: HEAD
BUILD_TIME: Mar 22 2024 00:52:21
BUILD_FLAGS: RelWithDebInfo
BUILD_INFO:
实验场景演示排查过程
第一个场景:排查当前正在锁的表。
第一步,我们开启一个事务,然后更新一行数据,但是不提交。
第二步,我们模仿程序,通过java代码调用sql,更新同一条数据。
可以看到,现在程序现在卡死的状态。这时我们就可以去库里排查,代码如下。
MySQL [oceanbase]> select * from __all_virtual_processlist where TENANT='htap' and user_client_ip<>'127.0.0.1'; +------------+------+--------+--------------------+------+---------+--------+------+-------+------+--------------+----------+----------+---------------------+---------------+----------------+-----------+----------+-----------+------------+----------+-------------+------------+-----------+------------+--------+--------+-------------+-----------+---------+-----------+---------------------+-------+-------------------+-----------------------+--------+--------+----------+----------+-----------+------------------+------------+--------------+ | id | user | tenant | host | db | command | sql_id | time | state | info | svr_ip | svr_port | sql_port | proxy_sessid | master_sessid | user_client_ip | user_host | trans_id | thread_id | ssl_cipher | trace_id | trans_state | total_time | retry_cnt | retry_info | action | module | client_info | sql_trace | plan_id | tenant_id | effective_tenant_id | level | sample_percentage | record_policy | lb_vid | lb_vip | lb_vport | in_bytes | out_bytes | user_client_port | proxy_user | service_name | +------------+------+--------+--------------------+------+---------+--------+------+-------+------+--------------+----------+----------+---------------------+---------------+----------------+-----------+----------+-----------+------------+----------+-------------+------------+-----------+------------+--------+--------+-------------+-----------+---------+-----------+---------------------+-------+-------------------+-----------------------+--------+--------+----------+----------+-----------+------------------+------------+--------------+ | 3221643380 | SYS | htap | 134.90.8.212:39832 | SYS | Sleep | | 24 | SLEEP | NULL | 134.90.8.212 | 2882 | 2881 | 9681060055801331737 | NULL | 134.90.8.212 | % | 3917058 | 0 | NULL | NULL | ACTIVE | 24 | 0 | 0 | | | | 1 | 3036 | 1002 | 1002 | 1 | 10 | SAMPLE_AND_SLOW_QUERY | NULL | NULL | NULL | 663 | 13527 | 59424 | NULL | NULL | | 3221704604 | SYS | htap | 134.90.8.212:43442 | SYS | Sleep | | 5 | SLEEP | NULL | 134.90.8.212 | 2882 | 2881 | 9681060055801331738 | NULL | 134.84.16.9 | % | 0 | 0 | NULL | NULL | | 5 | 6 | -6005 | | | | 1 | 3037 | 1002 | 1002 | 1 | 10 | SAMPLE_AND_SLOW_QUERY | NULL | NULL | NULL | 1078 | 11179 | 42302 | NULL | NULL | | 3222051655 | SYS | htap | 134.90.8.212:38844 | SYS | Sleep | | 969 | SLEEP | NULL | 134.90.8.216 | 2882 | 2881 | 9681060055801331737 | NULL | 134.90.8.212 | % | 0 | 0 | NULL | NULL | | 969 | 0 | 0 | | | | 1 | 322 | 1002 | 1002 | 1 | 10 | SAMPLE_AND_SLOW_QUERY | NULL | NULL | NULL | 638 | 8569 | 59424 | NULL | NULL | +------------+------+--------+--------------------+------+---------+--------+------+-------+------+--------------+----------+----------+---------------------+---------------+----------------+-----------+----------+-----------+------------+----------+-------------+------------+-----------+------------+--------+--------+-------------+-----------+---------+-----------+---------------------+-------+-------------------+-----------------------+--------+--------+----------+----------+-----------+------------------+------------+--------------+ 3 rows in set (0.02 sec) |
通过__all_virtual_processlist,我们可以看到持有锁的会话,以及被锁的会话,可以看到第二个session的retry_info有6005获取锁失败的报错,即被阻塞的会话。第一个active是持有锁,阻塞别人的会话,只要第一个会话回滚或提交,释放锁,第二个会话就可以获取到锁。
由于此次测试环境会话较少,能够清晰地看到我们需要的信息,在生产环境中这需要耗费点时间。因此,下面介绍其他方式供参考。
(1)业务反馈有积压,程序卡住,但不确定卡在哪些表上了。
针对这个问题,可以通过以下命令sql查询持有锁的所有会话及表名,并与业务确认卡顿程序涉及的表以定位锁,如果涉及杀会话,需要确认杀掉持有锁的会话会不会影响业务。
MySQL [oceanbase]> select a.tenant_id,a.svr_ip,a.ls_id,a.table_id,c.table_id table_actual_id,c.table_name,a.tablet_id,a.session_id,a.ctx_create_time -> from __all_virtual_trans_lock_stat a -> left join __all_virtual_tablet_to_ls b on b.tablet_id=a.tablet_id -> left join __all_virtual_table c on b.table_id=c.table_id; +-----------+--------------+-------+----------+-----------------+------------+-----------+------------+----------------------------+ | tenant_id | svr_ip | ls_id | table_id | table_actual_id | table_name | tablet_id | session_id | ctx_create_time | +-----------+--------------+-------+----------+-----------------+------------+-----------+------------+----------------------------+ | 1002 | 169.78.3.112 | 1001 | 0 | 500008 | ZRY2 | 200010 | 3221643380 | 2024-04-12 10:32:52.880125 | +-----------+--------------+-------+----------+-----------------+------------+-----------+------------+----------------------------+ 1 row in set (0.09 sec) |
(2)业务反馈涉及某个表的程序卡住
针对这个问题,我们只需要查询该表是否有锁,查询语句如下,该sql是根据表名查询持有锁的会话。
MySQL [oceanbase]> select * from (select a.tenant_id,a.svr_ip,a.ls_id,a.table_id,c.table_id table_actual_id,c.table_name,a.tablet_id,a.session_id,a.ctx_create_time -> from __all_virtual_trans_lock_stat a -> left join __all_virtual_tablet_to_ls b on b.tablet_id=a.tablet_id -> left join __all_virtual_table c on b.table_id=c.table_id) t where t.table_name like '%ZRY2%'; +-----------+--------------+-------+----------+-----------------+------------+-----------+------------+----------------------------+ | tenant_id | svr_ip | ls_id | table_id | table_actual_id | table_name | tablet_id | session_id | ctx_create_time | +-----------+--------------+-------+----------+-----------------+------------+-----------+------------+----------------------------+ | 1002 | 169.78.3.112 | 1001 | 0 | 500008 | ZRY2 | 200010 | 3221643380 | 2024-04-12 10:32:52.880125 | +-----------+----------- |