lucky-sideburn/kubeinvaders

V1.9.6 sends invalid prometheus metrics

AlexPSplunk opened this issue · 10 comments

in v1.9.6, if you manually start a chaos job, the /metrics endpoint reports invalid prometheus metrics:
chaos_jobs_status:porpoise:cpu-attack-exp:cpu-attack-job Created
chaos_jobs_status:porpoise:cpu-attack-exp:cpu-attack-job-kzflg Succeeded

The prometheus format requires that value be a float, not a string. Sending these status messages in the /metrics endpoint causes some prometheus scrapers to fail.

Dear @AlexPSplunk

Thank you for the issue. I am thinking to do this change:

If the job is created the value is 0
scripts/programming_mode/start.py: r.set(f"chaos_jobs_status:{codename}:{exp['name']}:{exp['job']}", "Created")

If the phase is running or succeeded is still 1
scripts/metrics_loop/start.py: r.set(f"chaos_jobs_status:{codename}:{exp_name}:{job_name}", pod.status.phase)

For failed jobs the value will be -1

What do you think? Is ok if I use integers with the values above?

Thank you,
Eugenio

Hi @AlexPSplunk

I have fixed with this commit

You can pull again the versione v1.9.7 or use latest.

Perfect solution - thanks for the quick response.

@AlexPSplunk thank you! I'm glad the tool is used in Splunk! We are a small Italian startup ( https://platformengineering.it/ - https://devopstribe.it/) maybe one day if we grow we could collaborate with Splunk :D

I should have done more testing before replying :(. The metric name itself is also invalid. Prometheus metric names cannot contain a hyphen. See this documentation for acceptable metric names format.

Dear @AlexPSplunk

no problem :) Now it seems ok. You can pull again the versione v1.9.7 or use latest.

$ eugenio@DESKTOP-BUVN7L3 (~) > curl http://localhost:8080/metrics
chaos_node_jobs_total 10
current_chaos_job_pod 0
chaos_jobs_status:cockroach:mem_attack_exp:mem_attack_job_htnmd 1.0
chaos_jobs_status:cockroach:mem_attack_exp:mem_attack_job_zkuof 1.0
chaos_jobs_status:cockroach:cpu_attack_exp:cpu_attack_job_nrrxl 1.0
chaos_jobs_status:cockroach:mem_attack_exp:mem_attack_job_ettpl 1.0
chaos_jobs_status:mendelevium:mem_attack_exp:mem_attack_job_xowxi 1.0
chaos_jobs_status:mendelevium:cpu_attack_exp:cpu_attack_job_psagx 1.0
chaos_jobs_status:cockroach:cpu_attack_exp:cpu_attack_job_ecjts 1.0
chaos_jobs_status:mendelevium:mem_attack_exp:mem_attack_job_rnice 1.0
chaos_jobs_status:mendelevium:cpu_attack_exp:cpu_attack_job_ajpqm 1.0
chaos_jobs_status:mendelevium:cpu_attack_exp:cpu_attack_job_ksmpm 1.0
chaos_jobs_status:mendelevium:mem_attack_exp:mem_attack_job_ssmai 1.0
chaos_jobs_status:mendelevium:cpu_attack_exp:cpu_attack_job_oghgc 1.0
chaos_jobs_status:mendelevium:mem_attack_exp:mem_attack_job_nndoq 1.0
chaos_jobs_status:mendelevium:mem_attack_exp:mem_attack_job_hsfgi 1.0
chaos_jobs_status:mendelevium:cpu_attack_exp:cpu_attack_job_bnkap 1.0
chaos_jobs_status:cockroach:cpu_attack_exp:cpu_attack_job_nyqct 1.0
chaos_jobs_status:cockroach:cpu_attack_exp:cpu_attack_job_vezuf 1.0
chaos_jobs_status:cockroach:mem_attack_exp:mem_attack_job_livba 1.0
chaos_jobs_status:cockroach:mem_attack_exp:mem_attack_job_bekub 1.0
chaos_jobs_status:cockroach:cpu_attack_exp:cpu_attack_job_niaki 1.0

Confirming that it reads now.

You might consider using labels instead of including the codename, experiment, and job in the metric name. That way you can use the actual name of the job instead of converting dashes to underscores.
E.g.:
chaos_jobs_status{codename="mole",experiment="cpu-attack-exp", job="cpu-attack-job-aegbv"} 1

Thank @AlexPSplunk

I will do it.

If you have other feedbacks about the pogramming mode console please let me know :)

One other thing you could do is provide types for the metrics. For example, you probably want the metrics ending in _total to be Counters - but since the metrics are untyped, everything reads as a Gauge.
See: Prometheus metric types
Probably the simplest way too do this is to use Prometheus Exposition format .
Examples here

@AlexPSplunk ok I will do it! Yesterday I have pushed a fixed chaos programming console (with code editor and some fix). I think that when I come back from vacation I write code about your suggestion