系统任务执行状态自监控能力支持
normal-wls opened this issue · 3 comments
normal-wls commented
- 梳理需要监控的指标
- 指标数据获取方案(获取方案,性能评估)
- 开发方案(开发结构设计、sops/engine)
- 产品前端支持(需要先沟通)
lTimej commented
监控指标:
1.执行中失败的任务
2.执行中的任务
3.schedule轮训次数较多的数据
lTimej commented
执行失败任务sql
EXPLAIN SELECT
tt.id,cp.name, pp.name
FROM
`taskflow3_taskflowinstance` AS tt,
`core_project` AS cp,
`pipeline_pipelineinstance` AS pp,
`eri_state` AS es
WHERE
pp.instance_id = es.root_id
AND tt.pipeline_instance_id=pp.id
AND tt.project_id=cp.id
AND pp.is_deleted = 0
AND pp.is_expired = 0
AND pp.is_finished = 0
AND pp.is_revoked = 0
AND pp.is_started = 1
AND es.NAME = "FAILED"
ORDER BY
pp.id DESC
执行失败任务ORM
sql = """
EXPLAIN SELECT
tt.id,cp.name, pp.name
FROM
`taskflow3_taskflowinstance` AS tt,
`core_project` AS cp,
`pipeline_pipelineinstance` AS pp,
`eri_state` AS es
WHERE
pp.instance_id = es.root_id
AND tt.pipeline_instance_id=pp.id
AND tt.project_id=cp.id
AND pp.is_deleted = 0
AND pp.is_expired = 0
AND pp.is_finished = 0
AND pp.is_revoked = 0
AND pp.is_started = 1
AND es.NAME = "FAILED"
ORDER BY
pp.id DESC
"""
with connection.cursor() as cursor:
cursor.execute(sql)
results = cursor.fetchall()
print(results)
执行中任务sql
explain SELECT
es.root_id,
COUNT( es.root_id ) AS total_count,
SUM( CASE WHEN es.NAME = 'Failed' THEN 1 ELSE 0 END ) AS failed_count,
SUM( CASE WHEN es.NAME = 'FINISHED' THEN 1 ELSE 0 END ) AS finished_count
FROM
pipeline_pipelineinstance AS pp
INNER JOIN eri_state AS es ON pp.instance_id = es.root_id
WHERE
pp.is_started = 1
AND pp.is_finished = 0
AND pp.is_revoked = 0
AND pp.is_expired = 0
AND is_deleted = 0
GROUP BY
es.root_id
HAVING
failed_count = 0
AND finished_count != total_count
ORDER BY
pp.id
limit 1000
执行中任务ORM
sql = """
SELECT
es.root_id,
COUNT( es.root_id ) AS total_count,
SUM( CASE WHEN es.NAME = 'Failed' THEN 1 ELSE 0 END ) AS failed_count,
SUM( CASE WHEN es.NAME = 'FINISHED' THEN 1 ELSE 0 END ) AS finished_count
FROM
pipeline_pipelineinstance AS pp
INNER JOIN eri_state AS es ON pp.instance_id = es.root_id
WHERE
pp.is_started = 1
AND pp.is_finished = 0
AND pp.is_revoked = 0
AND pp.is_expired = 0
AND is_deleted = 0
GROUP BY
es.root_id
HAVING
failed_count = 0
AND finished_count != total_count
ORDER BY
pp.id
limit 1000
"""
with connection.cursor() as cursor:
cursor.execute(sql)
results = cursor.fetchall()
print(results)
schedule轮训次数
EXPLAIN SELECT
pp.creator,
pp.id,
esc.schedule_times
FROM
eri_schedule AS esc,
eri_state AS es,
pipeline_pipelineinstance AS pp
WHERE
esc.node_id = es.node_id
AND es.root_id = pp.instance_id
AND esc.scheduling = 0
ORDER BY
esc.schedule_times DESC
LIMIT 1000;
sql = """
EXPLAIN SELECT
pp.creator,
pp.id,
esc.schedule_times
FROM
eri_schedule AS esc,
eri_state AS es,
pipeline_pipelineinstance AS pp
WHERE
esc.node_id = es.node_id
AND es.root_id = pp.instance_id
AND esc.scheduling = 0
ORDER BY
esc.schedule_times DESC
LIMIT 1000;
"""
with connection.cursor() as cursor:
cursor.execute(sql)
results = cursor.fetchall()
print(results)
normal-wls commented
补充:
- sql 性能优化:可以考虑拆分查询,减少联表
- mq 数据获取:通过 client 获取 mq 数据,并提供查询接口
- 聚合数据接口:某段时间(1天/3天)正在执行/失败任务数聚合