hbz/lobid-resources

Play app crashes a few minutes after starting ETL

Closed this issue · 5 comments

dr0i commented

2023-10-20 02:39:41 GMT+02:00 [INFO] from application in application-akka.actor.default-dispatcher-480 - Called from: '10.9.2.41'
Starting ETL of 'update'..

[...] nothing indicates what the cause of the crash is. It's just restarted 9 minutes after triggering the ETL:

2023-10-20 02:48:10 GMT+02:00 [INFO] from play.api.Play in main - Application started (Prod)
2023-10-20 02:48:10 GMT+02:00 [INFO] from play.core.server.NettyServer in main - Listening for HTTP on /0.0.0.0:7507

May be related to #1264.

dr0i commented

Maybe it would be a good idea to separate ETL from lobid API. E.g. we could use the fallback (q3) to do the ETL part.

dr0i commented

Uh - that's teh cause of the restart (from monit.log):

[2023-10-20T02:45:23+0200] warning : 'alma-lobid-localhost' failed protocol test [HTTP] at [localhost]:7507/resources/search?q= [TCP/IP] -- HTTP: Error receiving data -- Resource temporarily unavailable
[2023-10-20T02:46:50+0200] error : 'lobid-resources-alma' failed protocol test [DEFAULT] at [127.0.0.1]:7507 [TCP/IP] -- Connection timed out
[2023-10-20T02:46:50+0200] info : 'lobid-resources-alma' trying to restart
[2023-10-20T02:46:50+0200] info : 'lobid-resources-alma' stop: '/home/sol/git/lobid-resources-alma/web/monit_restart.sh lobid-resources-alma/web stop 7507'
[2023-10-20T02:47:06+0200] info : 'lobid-resources-alma' start: '/home/sol/git/lobid-resources-alma/web/monit_restart.sh lobid-resources-alma/web start 7507 -Xmx20G,-Xms6G'

There is a check if the lobid-API is ok (query against localhost:7507) and if this runs into a timeout it is checked again until eventually monit restarts the lobid-API to make it responsive again.
The unresponsiveness must be rooted in the heavily threaded ETL processes, taking too much resources to have a responsive play app.
We could:
a) separate ETL from normal web API (as proposed before)
b) use lesser CPU cores (atm 10 of 12)
c) don't check if the lobid-app is up

dr0i commented

we go with a) . @blackwinter configures the ETL webhook caller.

Webhook URL has been changed from https://lobid.org/... to https://stage.lobid.org/....

Works so far.
Had forgotten to set switch.automatically to true. Did that now and also switched the index alias manually.
Closing.