oceanbase/obdiag

[Enhancement]: cpu.high 常见增加收集项目

Opened this issue · 0 comments

Description

obdiag gather scene run --scene=observer.cpu_high 采集的信息对分析ob集群 cpu占用高 问题 需要增加一些采集项目

1)TOP SQL信息 ,建议2个纬度,cpu占用和读物理盘行数。sql_audit视图有记录。
例如:
select sql_id,sum(execute_time),,sum(QUEUE_TIME), count(*), sum(MEMSTORE_READ_ROW_COUNT), sum(SSSTORE_READ_ROW_COUNT) from oceanbase.gv$ob_sql_audit where REQUEST_TIME between 1721113200000000 and 1721115000000000 group by sql_id order by 2 desc limit 10;

找到对应sql语句 获取最高占用的sql的执行计划和对应sql的GV$OB_SQL_PLAN信息。

再获取对应集群级和表级分区分布以及分区里数据量是否均匀(包括二级分区)

2)gv$ob_sql_audit 信息,需要保留采集时间的全信息。

3)QPS请求分布,可以使用下面这个sql
select t2.zone, t1.svr_ip, count(*) as QPS from oceanbase.gv$ob_sql_audit t1, oceanbase.__all_server t2 where t1.svr_ip = t2.svr_ip and t1.tenant_id = 1002 and IS_EXECUTOR_RPC = 0 and request_time > (time_to_usec(now()) - 1000000) and request_time < time_to_usec(now()) group by t1.svr_ip order by QPS;

根据上诉sql 获取前几个的QPS高的节点IP,查看QPS和计划类型

select plan_type, count(*) as QPS from oceanbase.gv$ob_sql_audit t1, oceanbase.__all_server t2 where t1.svr_ip = t2.svr_ip and t1.tenant_id = 1002 and IS_EXECUTOR_RPC = 0 and request_time > (time_to_usec(now()) - 1000000) and request_time < time_to_usec(now())
and t1.svr_ip='xx.xx.xx.xx'
group by plan_type order by 2 desc ;

4)检查占用高表的表结构 分区类型是否有函数分区方式,例如to_days(分区键) ,如果有 可能是路由不生效导致,并检查obproxy日志是否有分区解析失败信息。