Crash when connection reset by peer
lucymhdavies opened this issue ยท 12 comments
This Rancher Exporter works fine for us most of the time, but will occasionally crash, and Rancher will restart it, after which it is fine.
In this example, 10.1.2.3 is the IP address of the container on Rancher's Docker network, and 10.3.2.1 is an F5 Virtual IP which points at our two Rancher Server instances.
Ideally, the app would handle this gracefully.
e.g. retry after a few seconds.
Some example logs:
time="2017-07-26T11:54:40Z" level=info msg="Scraping: https://rancher.example.io/v1/environments/"
time="2017-07-26T11:54:40Z" level=info msg="Metrics successfully processed for stacks"
time="2017-07-26T11:54:40Z" level=info msg="Scraping: https://rancher.example.io/v1/services/"
time="2017-07-26T11:54:45Z" level=error msg="Error Collecting JSON from API: Get https://rancher.example.io/v1/services/: read tcp 10.1.2.3:40962->10.3.2.1:443: read: connection reset by peer"
time="2017-07-26T11:54:45Z" level=info msg="Scraping: https://rancher.example.io/v1/environments/"
panic: Get https://rancher.example.io/v1/services/: read tcp 10.1.2.3:40962->10.3.2.1:443: read: connection reset by peer
goroutine 52638 [running]:
panic(0x8574a0, 0xc820312390)
/usr/lib/go/src/runtime/panic.go:481 +0x3e6
main.getJSON(0xc820398930, 0x30, 0xc820014042, 0x14, 0xc820012012, 0x28, 0x7530c0, 0xc8200243d8, 0x0, 0x0)
/go/src/github.com/infinityworksltd/prometheus-rancher-exporter/gather.go:144 +0x543
main.(*Exporter).gatherData(0xc82005a720, 0xc82001208b, 0x26, 0xc820014042, 0x14, 0xc820012012, 0x28, 0x8d1f50, 0x8, 0xc8202028a0, ...)
/go/src/github.com/infinityworksltd/prometheus-rancher-exporter/gather.go:117 +0x17f
main.(*Exporter).Collect(0xc82005a720, 0xc8202028a0)
/go/src/github.com/infinityworksltd/prometheus-rancher-exporter/prometheus.go:36 +0x18b
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func2(0xc82033d2d0, 0xc8202028a0, 0x7f8f07e22950, 0xc82005a720)
/go/src/github.com/prometheus/client_golang/prometheus/registry.go:382 +0x58
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
/go/src/github.com/prometheus/client_golang/prometheus/registry.go:383 +0x360
We're running from infinityworks/prometheus-rancher-exporter:latest
, which was 094f6595dd62 when we saw this error
Thanks for the detailed bug report, I've not personally done much testing against HA instances.
I've had a look at the code, i suspect the issue is an error with the http.NewRequest in gather.go, seems the error isn't handled. I'll push up a fix shortly, would you be okay to test on your setup?
Yeah, should be fine. ๐
It's very intermittent though, so not yet figured out a way to force the error to happen.
Have you fixed it yet @Rucknar? ๐
@lucymhdavies fix pushed to :latest
& v0.22.64
.
@Willmarshall69 How about you shut the hell up? ;)
Fix looks like it should hopefully work, but...
11/3/2017 10:03:42 AMexec: "rancher_exporter": executable file not found in $PATH
Something up with your these images
Thanks Ed. Was going to PR your last two commits. Should work better now ๐ค
Awesome, sorry for that. Should make a mental note not to commit code before coffee.
infinityworks/prometheus-rancher-exporter:v0.22.88
and infinityworks/prometheus-rancher-exporter:latest
should now be good.
Well....
11/3/2017 10:18:57 AMtime="2017-11-03T10:18:57Z" level=info msg="Starting Prometheus Exporter for Rancher"
11/3/2017 10:18:57 AMtime="2017-11-03T10:18:57Z" level=info msg="Runtime Configuration in-use: URL of Rancher Server: https://rancher.example.io/v1 AccessKey: AFDEE5EF80A5C326E4D0System Services hidden: true"
11/3/2017 10:18:57 AMtime="2017-11-03T10:18:57Z" level=info msg="Starting Server on port :9173 and path /metrics"
11/3/2017 10:18:57 AMtime="2017-11-03T10:18:57Z" level=info msg="Scraping: https://rancher.example.io/v1/environments/"
11/3/2017 10:18:57 AMtime="2017-11-03T10:18:57Z" level=error msg="Error Collecting JSON from API: Get https://rancher.example.io/v1/environments/: x509: failed to load system roots and no roots provided"
11/3/2017 10:18:57 AMpanic: runtime error: invalid memory address or nil pointer dereference
11/3/2017 10:18:57 AM[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x6ac573]
11/3/2017 10:18:57 AM
11/3/2017 10:18:57 AMgoroutine 20 [running]:
11/3/2017 10:18:57 AMmain.getJSON(0xc420018700, 0x39, 0xc42001a042, 0x14, 0xc420018012, 0x28, 0x6d37c0, 0xc4201840f8, 0x921f10, 0xc420073450)
11/3/2017 10:18:57 AM /go/src/github.com/infinityworks/prometheus-rancher-exporter/gather.go:154 +0x463
11/3/2017 10:18:57 AMmain.(*Exporter).gatherData(0xc42005c6c0, 0xc42001808b, 0x2b, 0xc42001a042, 0x14, 0xc420018012, 0x28, 0x767087, 0x6, 0xc42005c960, ...)
11/3/2017 10:18:57 AM /go/src/github.com/infinityworks/prometheus-rancher-exporter/gather.go:120 +0x155
11/3/2017 10:18:57 AMmain.(*Exporter).Collect(0xc42005c6c0, 0xc42005c960)
11/3/2017 10:18:57 AM /go/src/github.com/infinityworks/prometheus-rancher-exporter/prometheus.go:35 +0x1fa
11/3/2017 10:18:57 AMgithub.com/prometheus/client_golang/prometheus.(*Registry).Gather.func2(0xc420024980, 0xc42005c960, 0x8eed80, 0xc42005c6c0)
11/3/2017 10:18:57 AM /go/src/github.com/prometheus/client_golang/prometheus/registry.go:383 +0x61
11/3/2017 10:18:57 AMcreated by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
11/3/2017 10:18:57 AM /go/src/github.com/prometheus/client_golang/prometheus/registry.go:381 +0x2e1
This bit makes me think this is due to the Dockerfile changes:
11/3/2017 10:18:57 AMtime="2017-11-03T10:18:57Z" level=error msg="Error Collecting JSON from API: Get https://rancher.example.io/v1/environments/: x509: failed to load system roots and no roots provided"
i.e. no root certificates in alpine:latest
perhaps.
I built an image from 777318f, the last commit before the Dockerfile changes, and it works fine.
I have a fix. Pull Request incoming.
#36 should fix that issue
Built/pushed to the same two tags as before.
Ta. Seems to be working now.
I'll leave a broken version (094f6595dd62
) running in one site, and this fixed version running in the other, and hopefully we should be able to see that only one site is alerting us ๐
Thanks Lucy, might be worth me giving the codebase some TLC ahead of the 2.0 release.