nshttpd/mikrotik-exporter

collector crash prevents other collectors from running

Opened this issue · 5 comments

Hi,
When one of the collector crashes it returns directly from the collector for loop. Wouldn't be better to log the error and continue with the remaining collectors ?
I can provide a PR for this.

Nacef

I also have experienced this and have a 'quick and dirty' hack to fix this, however i think it's something worth discussing before anything is commited.

The main collector functions, will return the error as soon as the collector has failed

if err != nil {
return err
}

Moving the return to outside of the loop, changing this to a continue and returning the errors as a slice of errors would work.

However this next hurdle is what we run into. Do you mark the overal scrape as successfull or failed.

if err != nil {
log.Errorf("ERROR: %s collector failed after %fs: %s", d.Name, duration.Seconds(), err)
success = 0
} else {
log.Debugf("OK: %s collector succeeded after %fs.", d.Name, duration.Seconds())
success = 1
}
ch <- prometheus.MustNewConstMetric(scrapeDurationDesc, prometheus.GaugeValue, duration.Seconds(), d.Name)
ch <- prometheus.MustNewConstMetric(scrapeSuccessDesc, prometheus.GaugeValue, success, d.Name)

My proposal here would be either:

  • split out success into each collector type. This then allows other scrapes to be identified as successful or not, and then the user can use this to throw an alert or whatever they would like to do for that collector
  • Split the collector metrics into /metrics/collectorName and then have a HTTP 500 returned when they are failing.

Both are discussed in the promethus docs.

My personal preference would be the first, but before anyone goes and writes a solution I thought it might be usefull to have the discussion and for @nshttpd to provide their opinion.

Hello guys!
I have the same issue and fixes with

if err != nil {
return err
}

if err != nil {
log.Errorf("ERROR: %s collector failed after %fs: %s", d.Name, duration.Seconds(), err)
success = 0
} else {
log.Debugf("OK: %s collector succeeded after %fs.", d.Name, duration.Seconds())
success = 1
}
ch <- prometheus.MustNewConstMetric(scrapeDurationDesc, prometheus.GaugeValue, duration.Seconds(), d.Name)
ch <- prometheus.MustNewConstMetric(scrapeSuccessDesc, prometheus.GaugeValue, success, d.Name)

did not help me(

These are not fixes, it's a disucssion on how this could be resolved. I've not touched the code for a while, so I'd have to look back at it on how to do the fix, but based off my comments it would mean that the success=0 would need to be set to 1 for the IPV6 collector, or the code for IPV6 to be commented out / controllable via a flag.

I changed success=0 to success=1 and this did not help. Have the same errors.

if err != nil {
log.Errorf("ERROR: %s collector failed after %fs: %s", d.Name, duration.Seconds(), err)
success = 0
} else {
log.Debugf("OK: %s collector succeeded after %fs.", d.Name, duration.Seconds())
success = 1
}
ch <- prometheus.MustNewConstMetric(scrapeDurationDesc, prometheus.GaugeValue, duration.Seconds(), d.Name)
ch <- prometheus.MustNewConstMetric(scrapeSuccessDesc, prometheus.GaugeValue, success, d.Name)

Still affects almost all of my devices since I have the same collectors on for everything