
Degradation of certificate stream after one week

Closed this issue · 5 comments

artgl commented

Just after certstream-server is started client retrieves up to 300 certs/sec. After one week of continuous work client shows zero number of updates. This is what I see in latest server logs:

`16:54:13.822 [info] Worker #PID<0.253.0> with url found 1 certificates [6978 -> 6979].

17:13:06.601 [info] Worker #PID<0.310.0> with url found 1 certificates [3460 -> 3461].
17:13:07.499 [info] Worker #PID<0.305.0> with url found 11 certificates [67275 -> 67286].
17:13:07.736 [info] Worker #PID<0.293.0> with url found 1 certificates [3703 -> 3704].
18:13:01.839 [info] Worker #PID<0.305.0> with url found 10 certificates [67286 -> 67296].
18:54:09.569 [info] Worker #PID<0.253.0> with url found 1 certificates [6979 -> 6980].
19:12:56.189 [info] Worker #PID<0.305.0> with url found 10 certificates [67296 -> 67306].
19:13:09.950 [info] Worker #PID<0.310.0> with url found 1 certificates [3461 -> 3462].
20:12:59.092 [info] Worker #PID<0.310.0> with url found 1 certificates [3462 -> 3463].
20:13:02.009 [info] Worker #PID<0.293.0> with url found 1 certificates [3704 -> 3705].
20:13:05.944 [info] Worker #PID<0.305.0> with url found 10 certificates [67306 -> 67316].
20:54:05.769 [info] Worker #PID<0.253.0> with url found 2 certificates [6980 -> 6982].
21:13:00.290 [info] Worker #PID<0.305.0> with url found 11 certificates [67316 -> 67327].
22:13:02.392 [info] Worker #PID<0.310.0> with url found 1 certificates [3463 -> 3464].
22:13:03.354 [info] Worker #PID<0.293.0> with url found 1 certificates [3705 -> 3706].
22:13:09.709 [info] Worker #PID<0.305.0> with url found 11 certificates [67327 -> 67338].
22:54:01.630 [info] Worker #PID<0.253.0> with url found 1 certificates [6982 -> 6983].
23:12:56.454 [info] Worker #PID<0.293.0> with url found 1 certificates [3706 -> 3707].
23:12:56.563 [info] Worker #PID<0.310.0> with url found 1 certificates [3464 -> 3465].
23:13:04.060 [info] Worker #PID<0.305.0> with url found 10 certificates [67338 -> 67348].
00:12:58.315 [info] Worker #PID<0.305.0> with url found 11 certificates [67348 -> 67359].
00:13:04.800 [info] Worker #PID<0.293.0> with url found 1 certificates [3707 -> 3708].`

It seems that most server threads which extracts domains from separate sources are dead, and only 4 threads are functional for now.

This bug repeats both on remote machine with old centos distr and on my working machine with Ubuntu 18.04. Erlang version on both machines:

Same here, even just consuming "official" CaliDog websocket seems to have this issue (very weird events distribution like 1 cert per minute and next day consistently 100 certs per second)

Hi there, it turns out that Heroku's daily dyno restart was actually masking an issue with the service which basically meant the supervisor tree was never fully initialized, and therefore wasn't prepared to properly restart things when errors occurred (leading to a slow, but difficult to diagnose degradation in service).

I have since fixed this, both in master and at, please let me know if you experience further issues, and sorry for the breakage.

@Fitblip I'm not sure if it's related to this issue, but I haven't seen any certificates come through the official websocket for about the last 90 minutes. I've confirmed it's not my iffy code by running the official CLI tool and checking the website - both have the same behaviour.

Is it a dyno thing again?

Howdy @aidansteele - this is actually an old github issue that I forgot to close and unrelated to the pipeline being down.

We have an issue with our provider currently that we're working through (there's been a good deal of flapping in the past few days, so the pipeline actually going down was ignored due to the false alerts - sorry for the downtime).

No problem at all! It’s a really amazing service, thank you for producing it and hosting it 👍