Nomon/nomad-exporter

An error has occurred during metrics collection

Opened this issue · 6 comments

I tried running this exporter but I am getting the following error

An error has occurred during metrics collection:

4 error(s) occurred:
* collected metric nomad_allocation_cpu label:<name:"alloc" value:"infra/statsd-exporter.statsd-exporter[0]" > label:<name:"group" value:"statsd-exporter" > label:<name:"job" value:"infra/statsd-exporter" > gauge:<value:4.02303193877551 >  was collected before with the same name and label values
* collected metric nomad_allocation_cpu_throttle label:<name:"alloc" value:"infra/statsd-exporter.statsd-exporter[0]" > label:<name:"group" value:"statsd-exporter" > label:<name:"job" value:"infra/statsd-exporter" > gauge:<value:0 >  was collected before with the same name and label values
* collected metric nomad_allocation_memory label:<name:"alloc" value:"infra/statsd-exporter.statsd-exporter[0]" > label:<name:"group" value:"statsd-exporter" > label:<name:"job" value:"infra/statsd-exporter" > gauge:<value:2.2781952e+07 >  was collected before with the same name and label values
* collected metric nomad_allocation_memory_limit label:<name:"alloc" value:"infra/statsd-exporter.statsd-exporter[0]" > label:<name:"group" value:"statsd-exporter" > label:<name:"job" value:"infra/statsd-exporter" > gauge:<value:256 >  was collected before with the same name and label values

i believe this is happens when there are several older allocations

nomad status infra/statsd-exporter
ID          = infra/statsd-exporter
Name        = infra/statsd-exporter
Type        = service
Priority    = 50
Datacenters = ovh
Status      = running
Periodic    = false

Summary
Task Group       Queued  Starting  Running  Failed  Complete  Lost
statsd-exporter  0       0         1        0       0         0

Allocations
ID        Eval ID   Node ID   Task Group       Desired  Status    Created At
57ec626a  60bc583d  375d5aaf  statsd-exporter  run      running   09/08/16 10:13:30 UTC
47ce6dd6  2e863db7  375d5aaf  statsd-exporter  stop     complete  09/08/16 09:28:16 UTC
16dc534e  5913852f  22defaf9  statsd-exporter  stop     complete  09/05/16 11:55:17 UTC

after manually triggering garbage collection the old allocations were gone and the exporter worked.

curl -X PUT  http://localhost:4646/v1/system/gc

nomad status infra/statsd-exporter
ID          = infra/statsd-exporter
Name        = infra/statsd-exporter
Type        = service
Priority    = 50
Datacenters = ovh
Status      = running
Periodic    = false

Summary
Task Group       Queued  Starting  Running  Failed  Complete  Lost
statsd-exporter  0       0         1        0       0         0

Allocations
ID        Eval ID   Node ID   Task Group       Desired  Status   Created At
57ec626a  60bc583d  375d5aaf  statsd-exporter  run      running  09/08/16 10:13:30 UTC



Nomon commented

Might need to add the allocation id as a label to the allocations or alternatively only collect from running allocations to ensure the uniqueness of the name + labels (job_name,group_name,alloc_name[alloc_index]). I will take a closer look later today.

I think only running allocations are of interest as they are the only ones that with interesting metrics.

I don't know if the alloc index is necessary, isn't alloc name already unique?

Nomon commented

the alloc index is included in the alloc name, if a group has count = 10 then the allocs have a name of task_name[alloc_index 0..9]

Sorry my mistake, I meant allocation ID, not index.

Would it be of interest to have allocation by type? Right now nomad_allocations shows all allocations.

I don't know if it would be interesting to have
nomad_allocations{status="running|completed"}

Nomon commented

might be useful, would allow monitoring queued counts etc. Same could perhaps be extended to nodes. And we could add evaluations by status as well in the future. We should go through the information the builtin stats providers (statsite, statsd, datadog etc) expose and try to emulate those to some extent.

Hi
The same error with nomad_serf_lan_member_status:

An error has occurred during metrics collection:

collected metric nomad_serf_lan_member_status label:<name:"class" value:"" > label:<name:"datacenter" value:"staging" > label:<name:"drain" value:"false" > label:<name:"node" value:"<cluster_member_hostname_here>" > gauge:<value:0 >  was collected before with the same name and label values

As you can see there is two servers with the same name.
I guess on of nomad's agent were losted and then executed new one.

~ $ nomad node-status
ID        DC        Name                    Class   Drain  Status
13c89393  staging  app1.test.local         <none>  false  ready
bcb94e93  staging  app1.test.local         <none>  false  down
98f7b583  staging  app3.test.local         <none>  false  ready
869ba8a7  staging  app7.test.local         <none>  false  ready
a5bac338  staging  app9.test.local         <none>  false  ready
f5cc2390  staging  app5.test.local         <none>  false  ready
28ed1f83  staging  app13.test.local        <none>  false  ready
0e71ac4e  staging  app11.test.local        <none>  false  ready

Didn't you think about adding label node_id?