TencentBlueKing/bk-sops

系统任务执行状态自监控能力支持

normal-wls opened this issue · 3 comments

  1. 梳理需要监控的指标
  2. 指标数据获取方案(获取方案,性能评估)
  3. 开发方案(开发结构设计、sops/engine)
  4. 产品前端支持(需要先沟通)
监控指标:
    1.执行中失败的任务
    2.执行中的任务
    3.schedule轮训次数较多的数据

执行失败任务sql

EXPLAIN SELECT
	tt.id,cp.name, pp.name
FROM
	`taskflow3_taskflowinstance` AS tt,
	`core_project` AS cp,
	`pipeline_pipelineinstance` AS pp,
	`eri_state` AS es
WHERE
		pp.instance_id = es.root_id 
		AND tt.pipeline_instance_id=pp.id
		AND tt.project_id=cp.id
		AND pp.is_deleted = 0 
		AND pp.is_expired = 0 
		AND pp.is_finished = 0 
		AND pp.is_revoked = 0 
		AND pp.is_started = 1 
		AND es.NAME = "FAILED" 
ORDER BY
pp.id DESC

执行失败任务ORM

        sql = """
            EXPLAIN SELECT
	tt.id,cp.name, pp.name
FROM
	`taskflow3_taskflowinstance` AS tt,
	`core_project` AS cp,
	`pipeline_pipelineinstance` AS pp,
	`eri_state` AS es
WHERE
		pp.instance_id = es.root_id 
		AND tt.pipeline_instance_id=pp.id
		AND tt.project_id=cp.id
		AND pp.is_deleted = 0 
		AND pp.is_expired = 0 
		AND pp.is_finished = 0 
		AND pp.is_revoked = 0 
		AND pp.is_started = 1 
		AND es.NAME = "FAILED" 
ORDER BY
pp.id DESC
        """
        with connection.cursor() as cursor:
            cursor.execute(sql)
            results = cursor.fetchall()
            print(results)

执行中任务sql

explain SELECT
	es.root_id,
	COUNT( es.root_id ) AS total_count,
	SUM( CASE WHEN es.NAME = 'Failed' THEN 1 ELSE 0 END ) AS failed_count,
	SUM( CASE WHEN es.NAME = 'FINISHED' THEN 1 ELSE 0 END ) AS finished_count 
FROM
	pipeline_pipelineinstance AS pp
	INNER JOIN eri_state AS es ON pp.instance_id = es.root_id 
WHERE
	pp.is_started = 1 
	AND pp.is_finished = 0 
	AND pp.is_revoked = 0 
	AND pp.is_expired = 0 
	AND is_deleted = 0 
GROUP BY
	es.root_id
HAVING
	failed_count = 0 
	AND finished_count != total_count 
ORDER BY
	pp.id
limit 1000

执行中任务ORM

sql = """
            SELECT
                es.root_id,
                COUNT( es.root_id ) AS total_count,
                SUM( CASE WHEN es.NAME = 'Failed' THEN 1 ELSE 0 END ) AS failed_count,
                SUM( CASE WHEN es.NAME = 'FINISHED' THEN 1 ELSE 0 END ) AS finished_count 
            FROM
                pipeline_pipelineinstance AS pp
                INNER JOIN eri_state AS es ON pp.instance_id = es.root_id 
            WHERE
                pp.is_started = 1 
                AND pp.is_finished = 0 
                AND pp.is_revoked = 0 
                AND pp.is_expired = 0 
                AND is_deleted = 0 
            GROUP BY
                es.root_id
            HAVING
                failed_count = 0 
                AND finished_count != total_count 
            ORDER BY
                pp.id
            limit 1000
        """
        with connection.cursor() as cursor:
            cursor.execute(sql)
            results = cursor.fetchall()
            print(results)

schedule轮训次数

EXPLAIN SELECT
                pp.creator,
                pp.id,
                esc.schedule_times 
            FROM
                eri_schedule AS esc,
                eri_state AS es,
                pipeline_pipelineinstance AS pp 
            WHERE
                esc.node_id = es.node_id 
                AND es.root_id = pp.instance_id 
                AND esc.scheduling = 0 
            ORDER BY
                esc.schedule_times DESC
            LIMIT 1000;
sql = """
            EXPLAIN SELECT
                pp.creator,
                pp.id,
                esc.schedule_times 
            FROM
                eri_schedule AS esc,
                eri_state AS es,
                pipeline_pipelineinstance AS pp 
            WHERE
                esc.node_id = es.node_id 
                AND es.root_id = pp.instance_id 
                AND esc.scheduling = 0 
            ORDER BY
                esc.schedule_times DESC
            LIMIT 1000;
        """
        with connection.cursor() as cursor:
            cursor.execute(sql)
            results = cursor.fetchall()
            print(results)

补充:

  1. sql 性能优化:可以考虑拆分查询,减少联表
  2. mq 数据获取:通过 client 获取 mq 数据,并提供查询接口
  3. 聚合数据接口:某段时间(1天/3天)正在执行/失败任务数聚合