Crash when connection reset by peer

Question

Crash when connection reset by peer

lucymhdavies opened this issue 7 years ago · 12 comments

This Rancher Exporter works fine for us most of the time, but will occasionally crash, and Rancher will restart it, after which it is fine.

In this example, 10.1.2.3 is the IP address of the container on Rancher's Docker network, and 10.3.2.1 is an F5 Virtual IP which points at our two Rancher Server instances.

Ideally, the app would handle this gracefully.
e.g. retry after a few seconds.

Some example logs:

time="2017-07-26T11:54:40Z" level=info msg="Scraping: https://rancher.example.io/v1/environments/"
time="2017-07-26T11:54:40Z" level=info msg="Metrics successfully processed for stacks"
time="2017-07-26T11:54:40Z" level=info msg="Scraping: https://rancher.example.io/v1/services/"
time="2017-07-26T11:54:45Z" level=error msg="Error Collecting JSON from API: Get https://rancher.example.io/v1/services/: read tcp 10.1.2.3:40962->10.3.2.1:443: read: connection reset by peer"
time="2017-07-26T11:54:45Z" level=info msg="Scraping: https://rancher.example.io/v1/environments/"
panic: Get https://rancher.example.io/v1/services/: read tcp 10.1.2.3:40962->10.3.2.1:443: read: connection reset by peer

goroutine 52638 [running]:
panic(0x8574a0, 0xc820312390)
	/usr/lib/go/src/runtime/panic.go:481 +0x3e6
main.getJSON(0xc820398930, 0x30, 0xc820014042, 0x14, 0xc820012012, 0x28, 0x7530c0, 0xc8200243d8, 0x0, 0x0)
	/go/src/github.com/infinityworksltd/prometheus-rancher-exporter/gather.go:144 +0x543
main.(*Exporter).gatherData(0xc82005a720, 0xc82001208b, 0x26, 0xc820014042, 0x14, 0xc820012012, 0x28, 0x8d1f50, 0x8, 0xc8202028a0, ...)
	/go/src/github.com/infinityworksltd/prometheus-rancher-exporter/gather.go:117 +0x17f
main.(*Exporter).Collect(0xc82005a720, 0xc8202028a0)
	/go/src/github.com/infinityworksltd/prometheus-rancher-exporter/prometheus.go:36 +0x18b
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func2(0xc82033d2d0, 0xc8202028a0, 0x7f8f07e22950, 0xc82005a720)
	/go/src/github.com/prometheus/client_golang/prometheus/registry.go:382 +0x58
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
	/go/src/github.com/prometheus/client_golang/prometheus/registry.go:383 +0x360

We're running from infinityworks/prometheus-rancher-exporter:latest, which was 094f6595dd62 when we saw this error

Answer 1 · 2017-11-02T13:52:50.000Z

Thanks for the detailed bug report, I've not personally done much testing against HA instances.
I've had a look at the code, i suspect the issue is an error with the http.NewRequest in gather.go, seems the error isn't handled. I'll push up a fix shortly, would you be okay to test on your setup?

Answer 2 · 2017-11-02T13:55:31.000Z

Yeah, should be fine. 👍

It's very intermittent though, so not yet figured out a way to force the error to happen.

Answer 3 · 2017-11-02T14:09:35.000Z

Have you fixed it yet @Rucknar? 😉

Answer 4 · 2017-11-03T07:38:08.000Z

@lucymhdavies fix pushed to :latest & v0.22.64.
@Willmarshall69 How about you shut the hell up? ;)

Answer 5 · 2017-11-03T10:06:25.000Z

Fix looks like it should hopefully work, but...

11/3/2017 10:03:42 AMexec: "rancher_exporter": executable file not found in $PATH

Something up with your these images

Answer 6 · 2017-11-03T10:16:08.000Z

Thanks Ed. Was going to PR your last two commits. Should work better now 🤞

Answer 7 · 2017-11-03T10:19:08.000Z

Awesome, sorry for that. Should make a mental note not to commit code before coffee.
infinityworks/prometheus-rancher-exporter:v0.22.88 and infinityworks/prometheus-rancher-exporter:latest should now be good.

Answer 8 · 2017-11-03T10:34:45.000Z

Well....

11/3/2017 10:18:57 AMtime="2017-11-03T10:18:57Z" level=info msg="Starting Prometheus Exporter for Rancher"
11/3/2017 10:18:57 AMtime="2017-11-03T10:18:57Z" level=info msg="Runtime Configuration in-use: URL of Rancher Server: https://rancher.example.io/v1 AccessKey: AFDEE5EF80A5C326E4D0System Services hidden: true"
11/3/2017 10:18:57 AMtime="2017-11-03T10:18:57Z" level=info msg="Starting Server on port :9173 and path /metrics"
11/3/2017 10:18:57 AMtime="2017-11-03T10:18:57Z" level=info msg="Scraping: https://rancher.example.io/v1/environments/"
11/3/2017 10:18:57 AMtime="2017-11-03T10:18:57Z" level=error msg="Error Collecting JSON from API: Get https://rancher.example.io/v1/environments/: x509: failed to load system roots and no roots provided"
11/3/2017 10:18:57 AMpanic: runtime error: invalid memory address or nil pointer dereference
11/3/2017 10:18:57 AM[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x6ac573]
11/3/2017 10:18:57 AM
11/3/2017 10:18:57 AMgoroutine 20 [running]:
11/3/2017 10:18:57 AMmain.getJSON(0xc420018700, 0x39, 0xc42001a042, 0x14, 0xc420018012, 0x28, 0x6d37c0, 0xc4201840f8, 0x921f10, 0xc420073450)
11/3/2017 10:18:57 AM	/go/src/github.com/infinityworks/prometheus-rancher-exporter/gather.go:154 +0x463
11/3/2017 10:18:57 AMmain.(*Exporter).gatherData(0xc42005c6c0, 0xc42001808b, 0x2b, 0xc42001a042, 0x14, 0xc420018012, 0x28, 0x767087, 0x6, 0xc42005c960, ...)
11/3/2017 10:18:57 AM	/go/src/github.com/infinityworks/prometheus-rancher-exporter/gather.go:120 +0x155
11/3/2017 10:18:57 AMmain.(*Exporter).Collect(0xc42005c6c0, 0xc42005c960)
11/3/2017 10:18:57 AM	/go/src/github.com/infinityworks/prometheus-rancher-exporter/prometheus.go:35 +0x1fa
11/3/2017 10:18:57 AMgithub.com/prometheus/client_golang/prometheus.(*Registry).Gather.func2(0xc420024980, 0xc42005c960, 0x8eed80, 0xc42005c6c0)
11/3/2017 10:18:57 AM	/go/src/github.com/prometheus/client_golang/prometheus/registry.go:383 +0x61
11/3/2017 10:18:57 AMcreated by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
11/3/2017 10:18:57 AM	/go/src/github.com/prometheus/client_golang/prometheus/registry.go:381 +0x2e1

This bit makes me think this is due to the Dockerfile changes:

11/3/2017 10:18:57 AMtime="2017-11-03T10:18:57Z" level=error msg="Error Collecting JSON from API: Get https://rancher.example.io/v1/environments/: x509: failed to load system roots and no roots provided"

i.e. no root certificates in alpine:latest perhaps.

I built an image from 777318f, the last commit before the Dockerfile changes, and it works fine.

I have a fix. Pull Request incoming.

Answer 9 · 2017-11-03T10:37:34.000Z

#36 should fix that issue

Answer 10 · 2017-11-03T10:42:52.000Z

Built/pushed to the same two tags as before.

Answer 11 · 2017-11-03T10:44:17.000Z

Ta. Seems to be working now.

I'll leave a broken version (094f6595dd62) running in one site, and this fixed version running in the other, and hopefully we should be able to see that only one site is alerting us 👍

Answer 12 · 2017-11-03T10:46:15.000Z

Thanks Lucy, might be worth me giving the codebase some TLC ahead of the 2.0 release.