bird-house/twitcher

Twitcher lock up with waitress.queue problem

tlvu opened this issue · 3 comments

tlvu commented

Twitcher version pavics/twitcher:magpie-3.12.0 (from PAVICS stack https://github.com/bird-house/birdhouse-deploy/tree/1.13.6).

When trying to access https://pavics.ouranos.ca/thredds/ it was completely unresponsive. That link goes through the Nginx proxy, then Twitcher, then Thredds itself.

Nginx was eliminated by hitting https://pavics.ouranos.ca/jupyter and it works. That links goes through Nginx proxy then Jupyter, no Twitcher involved.

Thredds was eliminated by hitting directly Thredds at http://pavics.ouranos.ca:8083/twitcher/ows/proxy/thredds/catalog.html (works from internal only).

So the only component left was Twitcher and its extension Magpie so I did ./pavics-compose.sh restart twitcher magpie and the problem is gone.

Not sure how to reproduce this problem again. @tlogan2000 reported this to me and I had to quickly diagnose then restore the production deployment to working state.

@fmigneault @dbyrns have you ever experience this before? It's first time for us. Does the logs below ring you any bells?

Here are the Twitcher log (docker logs twitcher), timestamp are in UTC:

2021-06-08 17:42:39,577 INFO [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-06-08 17:42:39,583 INFO [TWITCHER:132][waitress-3] 'None' request 'browse' permission on '/ows/proxy/thredds/ncml/birdhouse/cccs_portal/indices/Final/ECCC_CRD/SPEI/Observed/3-Month/CCRC_CANGRD_MON_1900_2014_spei_bc_ref_period_195001_200512_timev_3-Month.nc'
2021-06-08 17:42:39,620 INFO [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-06-08 17:42:44,920 INFO [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-06-08 17:42:44,927 INFO [TWITCHER:132][waitress-3] 'None' request 'browse' permission on '/ows/proxy/thredds/catalog/birdhouse/cccs_portal/indices/BCCAQv2/txgt_30/CNRM-CM5/catalog.html'
2021-06-08 17:42:44,963 INFO [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-06-08 17:42:48,276 INFO [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-06-08 17:42:48,282 INFO [TWITCHER:132][waitress-3] 'None' request 'browse' permission on '/ows/proxy/thredds/ncml/birdhouse/cmip5/NASA-GISS/GISS-E2-H/historical/mon/atmos/r3i1p1/tasmin/tasmin_Amon_GISS-E2-H_historical_r3i1p1_190101-195012.nc'
2021-06-08 17:42:48,319 INFO [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-06-08 17:42:49,458 INFO [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-06-08 17:42:49,465 INFO [TWITCHER:132][waitress-3] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/datasets/simulations/climex/day_climex-crcm5_historical+rcp85.ncml.dods'
2021-06-08 17:42:49,488 INFO [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-06-08 17:42:54,282 WARNI [waitress.queue:117][MainThread] Task queue depth is 1
2021-06-08 17:43:01,509 WARNI [waitress.queue:117][MainThread] Task queue depth is 2
2021-06-08 17:43:07,961 WARNI [waitress.queue:117][MainThread] Task queue depth is 3
2021-06-08 17:43:38,678 WARNI [waitress.queue:117][MainThread] Task queue depth is 4
2021-06-08 17:43:44,135 WARNI [waitress.queue:117][MainThread] Task queue depth is 5
2021-06-08 17:43:58,365 WARNI [waitress.queue:117][MainThread] Task queue depth is 6
2021-06-08 17:44:51,628 WARNI [waitress.queue:117][MainThread] Task queue depth is 7
2021-06-08 17:45:41,402 WARNI [waitress.queue:117][MainThread] Task queue depth is 8
2021-06-08 17:46:36,007 WARNI [waitress.queue:117][MainThread] Task queue depth is 9
2021-06-08 17:46:58,702 WARNI [waitress.queue:117][MainThread] Task queue depth is 10
2021-06-08 17:47:26,503 WARNI [waitress.queue:117][MainThread] Task queue depth is 11
2021-06-08 17:47:36,827 WARNI [waitress.queue:117][MainThread] Task queue depth is 12
2021-06-08 17:48:24,873 WARNI [waitress.queue:117][MainThread] Task queue depth is 13
2021-06-08 17:48:54,593 WARNI [waitress.queue:117][MainThread] Task queue depth is 14
2021-06-08 17:49:00,128 WARNI [waitress.queue:117][MainThread] Task queue depth is 15
2021-06-08 17:49:01,716 WARNI [waitress.queue:117][MainThread] Task queue depth is 16
2021-06-08 17:49:14,948 WARNI [waitress.queue:117][MainThread] Task queue depth is 17
2021-06-08 17:49:44,461 WARNI [waitress.queue:117][MainThread] Task queue depth is 18
2021-06-08 17:50:21,796 WARNI [waitress.queue:117][MainThread] Task queue depth is 19
2021-06-08 17:51:11,874 WARNI [waitress.queue:117][MainThread] Task queue depth is 20
2021-06-08 17:51:58,776 WARNI [waitress.queue:117][MainThread] Task queue depth is 21
2021-06-08 17:52:34,107 WARNI [waitress.queue:117][MainThread] Task queue depth is 22
2021-06-08 17:53:24,177 WARNI [waitress.queue:117][MainThread] Task queue depth is 23
2021-06-08 17:54:55,128 WARNI [waitress.queue:117][MainThread] Task queue depth is 24
2021-06-08 17:55:01,815 WARNI [waitress.queue:117][MainThread] Task queue depth is 25
2021-06-08 17:55:09,748 WARNI [waitress.queue:117][MainThread] Task queue depth is 26
2021-06-08 17:55:18,775 WARNI [waitress.queue:117][MainThread] Task queue depth is 27
2021-06-08 17:55:47,210 WARNI [waitress.queue:117][MainThread] Task queue depth is 28
2021-06-08 17:56:09,282 WARNI [waitress.queue:117][MainThread] Task queue depth is 29
2021-06-08 17:56:58,328 WARNI [waitress.queue:117][MainThread] Task queue depth is 30
2021-06-08 17:59:07,503 WARNI [waitress.queue:117][MainThread] Task queue depth is 31
2021-06-08 17:59:57,601 WARNI [waitress.queue:117][MainThread] Task queue depth is 32
2021-06-08 18:00:55,708 WARNI [waitress.queue:117][MainThread] Task queue depth is 33
2021-06-08 18:01:02,055 WARNI [waitress.queue:117][MainThread] Task queue depth is 34
2021-06-08 18:01:35,500 WARNI [waitress.queue:117][MainThread] Task queue depth is 35
2021-06-08 18:01:47,685 WARNI [waitress.queue:117][MainThread] Task queue depth is 36
2021-06-08 18:01:59,642 WARNI [waitress.queue:117][MainThread] Task queue depth is 37
2021-06-08 18:05:03,903 WARNI [waitress.queue:117][MainThread] Task queue depth is 38
2021-06-08 18:05:53,981 WARNI [waitress.queue:117][MainThread] Task queue depth is 39
2021-06-08 18:05:54,102 WARNI [waitress.queue:117][MainThread] Task queue depth is 40
2021-06-08 18:06:40,456 WARNI [waitress.queue:117][MainThread] Task queue depth is 41
2021-06-08 18:06:56,231 WARNI [waitress.queue:117][MainThread] Task queue depth is 42
2021-06-08 18:06:58,610 WARNI [waitress.queue:117][MainThread] Task queue depth is 43
2021-06-08 18:07:02,086 WARNI [waitress.queue:117][MainThread] Task queue depth is 44
2021-06-08 18:07:09,592 WARNI [waitress.queue:117][MainThread] Task queue depth is 45
2021-06-08 18:07:14,482 WARNI [waitress.queue:117][MainThread] Task queue depth is 46
2021-06-08 18:07:48,849 WARNI [waitress.queue:117][MainThread] Task queue depth is 47
2021-06-08 18:09:08,867 WARNI [waitress.queue:117][MainThread] Task queue depth is 48
2021-06-08 18:11:30,594 WARNI [waitress.queue:117][MainThread] Task queue depth is 49
2021-06-08 18:11:38,770 WARNI [waitress.queue:117][MainThread] Task queue depth is 50
2021-06-08 18:11:58,703 WARNI [waitress.queue:117][MainThread] Task queue depth is 51
2021-06-08 18:13:02,219 WARNI [waitress.queue:117][MainThread] Task queue depth is 52
2021-06-08 18:13:03,407 WARNI [waitress.queue:117][MainThread] Task queue depth is 53
2021-06-08 18:13:13,963 WARNI [waitress.queue:117][MainThread] Task queue depth is 54
2021-06-08 18:13:26,849 WARNI [waitress.queue:117][MainThread] Task queue depth is 55
2021-06-08 18:14:07,403 WARNI [waitress.queue:117][MainThread] Task queue depth is 56
2021-06-08 18:14:29,559 WARNI [waitress.queue:117][MainThread] Task queue depth is 57
2021-06-08 18:15:16,154 WARNI [waitress.queue:117][MainThread] Task queue depth is 58
2021-06-08 18:16:06,288 WARNI [waitress.queue:117][MainThread] Task queue depth is 59
2021-06-08 18:16:58,314 WARNI [waitress.queue:117][MainThread] Task queue depth is 60
2021-06-08 18:17:24,939 WARNI [waitress.queue:117][MainThread] Task queue depth is 61
2021-06-08 18:17:36,646 WARNI [waitress.queue:117][MainThread] Task queue depth is 62
2021-06-08 18:17:50,353 WARNI [waitress.queue:117][MainThread] Task queue depth is 63
2021-06-08 18:18:47,842 WARNI [waitress.queue:117][MainThread] Task queue depth is 64
2021-06-08 18:19:02,246 WARNI [waitress.queue:117][MainThread] Task queue depth is 65
2021-06-08 18:19:15,329 WARNI [waitress.queue:117][MainThread] Task queue depth is 66
2021-06-08 18:19:33,968 WARNI [waitress.queue:117][MainThread] Task queue depth is 67
2021-06-08 18:20:01,236 WARNI [waitress.queue:117][MainThread] Task queue depth is 68
2021-06-08 18:20:08,660 WARNI [waitress.queue:117][MainThread] Task queue depth is 69
2021-06-08 18:21:58,346 WARNI [waitress.queue:117][MainThread] Task queue depth is 70

Magpie logs for the same time range (docker logs magpie):

[2021-06-08 17:41:02,285] INFO       [ThreadPoolExecutor-0_2][magpie.utils] Request: [GET pavics.ouranos.ca /magpie/version]
[2021-06-08 17:41:02,287] INFO       [ThreadPoolExecutor-0_2][sqlalchemy.engine.base.Engine] BEGIN (implicit)
[2021-06-08 17:41:02,288] INFO       [ThreadPoolExecutor-0_2][sqlalchemy.engine.base.Engine] SELECT version_num FROM alembic_version
[2021-06-08 17:41:02,288] INFO       [ThreadPoolExecutor-0_2][sqlalchemy.engine.base.Engine] {}
[2021-06-08 17:41:02,289] INFO       [ThreadPoolExecutor-0_2][sqlalchemy.engine.base.Engine] ROLLBACK
[2021-06-08 17:42:02,227] INFO       [ThreadPoolExecutor-0_3][magpie.utils] Request: [GET pavics.ouranos.ca /magpie/version]
[2021-06-08 17:42:02,229] INFO       [ThreadPoolExecutor-0_3][sqlalchemy.engine.base.Engine] BEGIN (implicit)
[2021-06-08 17:42:02,229] INFO       [ThreadPoolExecutor-0_3][sqlalchemy.engine.base.Engine] SELECT version_num FROM alembic_version
[2021-06-08 17:42:02,229] INFO       [ThreadPoolExecutor-0_3][sqlalchemy.engine.base.Engine] {}
[2021-06-08 17:42:02,231] INFO       [ThreadPoolExecutor-0_3][sqlalchemy.engine.base.Engine] ROLLBACK
[2021-06-08 17:49:01,685] INFO       [ThreadPoolExecutor-0_0][magpie.utils] Request: [GET pavics.ouranos.ca /magpie/version]
[2021-06-08 17:49:01,687] INFO       [ThreadPoolExecutor-0_0][sqlalchemy.engine.base.Engine] BEGIN (implicit)
[2021-06-08 17:49:01,687] INFO       [ThreadPoolExecutor-0_0][sqlalchemy.engine.base.Engine] SELECT version_num FROM alembic_version
[2021-06-08 17:49:01,687] INFO       [ThreadPoolExecutor-0_0][sqlalchemy.engine.base.Engine] {}
[2021-06-08 17:49:01,689] INFO       [ThreadPoolExecutor-0_0][sqlalchemy.engine.base.Engine] ROLLBACK
[2021-06-08 18:18:34,503] INFO       [ThreadPoolExecutor-0_3][magpie.utils] Request: [GET pavics.ouranos.ca /magpie/]
[2021-06-08 18:18:34,944] INFO       [ThreadPoolExecutor-0_2][magpie.utils] Request: [GET pavics.ouranos.ca /magpie/static/themes/blue.css]
[2021-06-08 18:18:35,021] INFO       [ThreadPoolExecutor-0_2][magpie.utils] Request: [GET pavics.ouranos.ca /magpie/static/style.css]
[2021-06-08 18:18:35,126] INFO       [ThreadPoolExecutor-0_1][magpie.utils] Request: [GET pavics.ouranos.ca /magpie/static/users.png]
[2021-06-08 18:18:35,126] INFO       [ThreadPoolExecutor-0_1][magpie.utils] Request: [GET pavics.ouranos.ca /magpie/static/magpie.png]
[2021-06-08 18:18:35,210] INFO       [ThreadPoolExecutor-0_0][magpie.utils] Request: [GET pavics.ouranos.ca /magpie/static/settings_white.png]
[2021-06-08 18:18:35,229] INFO       [ThreadPoolExecutor-0_3][magpie.utils] Request: [GET pavics.ouranos.ca /magpie/static/groups.png]
[2021-06-08 18:18:35,230] INFO       [ThreadPoolExecutor-0_3][magpie.utils] Request: [GET pavics.ouranos.ca /magpie/static/services.png]
[2021-06-08 18:18:35,230] INFO       [ThreadPoolExecutor-0_1][magpie.utils] Request: [GET pavics.ouranos.ca /magpie/static/login.png]
[2021-06-08 18:18:35,433] INFO       [ThreadPoolExecutor-0_2][magpie.utils] Request: [GET pavics.ouranos.ca /magpie/static/settings.png]

This has popped up again this morning. I would say if this keeps up as a regular problem then it is pretty high priority as it essentially makes notebook examples / data access unusable. Jupyterlab users could most likely use the port 8083 trick but most would never know to do that..

tlvu commented

More info if it might help. In both cases when Twitcher seems unresponsive with those waitress.queue problem, Thredds was under load and probably responding slowly.

I have not seen that error specifically, but I remember having issues with waitress when investigating the performance issue for large file download. Maybe we can try switching to gunicorn and see if the issue comes up again?

Just replace:

use = egg:waitress#main
listen = 0.0.0.0:8000

by:

use = egg:gunicorn#main
host = 0.0.0.0
port = 8000

Adding more workers could also potentially help.
Task queue depth seems to be related to the number of pending requests vs available workers.
If large files are being downloaded via Twitcher, it could be that all workers are busy can cannot respond to new ones.