Spotbit Downtime / Logs Limited
shannona opened this issue · 3 comments
We had some type of downtime on Spotbit on Sunday:
daemon.log:
Dec 7 11:05:05 localhost systemd[1]: spotbit.service: Main process exited, code=killed, status=9/KILL
Dec 7 11:05:05 localhost systemd[1]: spotbit.service: Failed with result 'signal'.
SIGKILL
indicates to me that was user generated, but there was no one logged in at the time, so I assume some process choose to kill spotbit
for some reason.
The user.log
shows active usage up until shortly before that:
user.log:
Dec 7 11:04:26 localhost python3[431]: 127.0.0.1 - - [07/Dec/2020 11:04:26] "#033[37mGET /now/usd/kraken HTTP/1.1#033[0m" 200 -
Dec 7 11:04:26 localhost python3[431]: INFO:werkzeug:127.0.0.1 - - [07/Dec/2020 11:04:26] "#033[37mGET /now/usd/kraken HTTP/1.1#033[0m" 200 -
Dec 7 11:04:27 localhost python3[431]: INFO:spotbit:[2020-12-07 11:04:27.185816] | inserted into okcoin USDT 40000 times
Dec 7 11:04:51 localhost python3[431]: 127.0.0.1 - - [07/Dec/2020 11:04:51] "#033[37mGET /now/usd/kraken HTTP/1.1#033[0m" 200 -
Dec 7 11:04:51 localhost python3[431]: INFO:werkzeug:127.0.0.1 - - [07/Dec/2020 11:04:51] "#033[37mGET /now/usd/kraken HTTP/1.1#033[0m" 200 -
Dec 7 11:04:51 localhost python3[431]: 127.0.0.1 - - [07/Dec/2020 11:04:51] "#033[37mGET /now/usd/kraken HTTP/1.1#033[0m" 200 -
Dec 7 11:04:51 localhost python3[431]: INFO:werkzeug:127.0.0.1 - - [07/Dec/2020 11:04:51] "#033[37mGET /now/usd/kraken HTTP/1.1#033[0m" 200 -
There's a confusing mess of spotbit.err
and spotbit.log
logs in various places. The only error file, /root/spotbit.err
, is three days old (e.g., a day before the crash) and the most up-to-date log file, /home/spotbit/.spotbit/spotbit.log
, shows very little between 10-06 and 10-09. (It's also hard to read because most lines don't have a date.)
The system was in a very bad state when I found it. spotbit
was not running. I restarted it, then I restarted it and tor
, and still nothing was being passed to spotbit
. I rebooted the system and this time tor
didn't start right, and I rebooted the system a second time before everything finally came back into sync. I have to assume that's all due to something not shutting down cleanly, most likely the database, which was throwing locked
messages after the first reboot.
So, I know this isn't much info, here's a list of suggested things that might help:
- Make sure
spotbit
is logging an error file - See if we can improve
spotbit.log
to at least log date-time on all lines - Investigate if anything in
tor
orspotbit
might issue aSIGKILL
- See if there is any way to better restart
spotbit
cleanly after a messySIGKILL
- Monitor our services, including spotbit (I'm going to probably take this one on with Christopher next week)
@javiervargas found the answer in the system logs: an out of memory error. That answers point #3 above, which was the reason for this particular crash. spotbit is on a tiny 1GB-memory machine, and it might be that's not sufficient memory for large lookups. I've tried to expand things by pushing the swap from 512MB to the recommend 1GB. If that's insufficient, the machine probably needs to be pushed to a larger size.
I leave the following possible issues:
- Make sure spotbit is logging an error file
- See if we can improve spotbit.log to at least log date-time on all lines
Investigate if anything in tor or spotbit might issue a SIGKILL- See if there is any way to better restart spotbit cleanly after a messy SIGKILL
- Monitor our services, including spotbit (I'm going to probably take this one on with Christopher next week)
This is the log:
Dec 7 11:05:05 localhost kernel: [1696838.638615] [ 29318] 33 29318 53352 0 147456 2009 0 apache2
Dec 7 11:05:05 localhost kernel: [1696838.640057] [ 29319] 33 29319 53352 0 147456 1999 0 apache2
Dec 7 11:05:05 localhost kernel: [1696838.641506] [ 29320] 33 29320 53352 0 147456 2001 0 apache2
Dec 7 11:05:05 localhost kernel: [1696838.642947] [ 29321] 33 29321 53354 0 147456 2003 0 apache2
Dec 7 11:05:05 localhost kernel: [1696838.644385] [ 29322] 33 29322 53354 0 147456 2017 0 apache2
Dec 7 11:05:05 localhost kernel: [1696838.645828] [ 29359] 33 29359 53354 1 147456 2001 0 apache2
Dec 7 11:05:05 localhost kernel: [1696838.647278] [ 29360] 33 29360 53352 0 147456 2001 0 apache2
Dec 7 11:05:05 localhost kernel: [1696838.648712] [ 4779] 33 4779 53287 1 147456 1964 0 apache2
Dec 7 11:05:05 localhost kernel: [1696838.650160] Out of memory: Kill process 431 (python3) score 862 or sacrifice child
Dec 7 11:05:05 localhost kernel: [1696838.651265] Killed process 431 (python3) total-vm:2000224kB, anon-rss:875860kB, file-rss:0kB, shmem-rss:0kB
Dec 7 11:05:05 localhost kernel: [1696838.761387] oom_reaper: reaped process 431 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Dec 7 11:05:05 localhost systemd[1]: spotbit.service: Main process exited, code=killed, status=9/KILL
Dec 7 11:05:05 localhost systemd[1]: spotbit.service: Failed with result 'signal'.
Dec 7 11:05:25 localhost kernel: [1696859.041998] [UFW BLOCK] IN=eth0 OUT= MAC=f2:3c:92:47:2d:91:84:78:ac:57:a8:41:08:00 SRC=181.30.28.51 DST=104.200.29.8 LEN=100 TOS=0x00 PREC=0x00 TTL=52 ID=48552 DF PROTO=UDP SPT=4000 DPT=11211 LEN=80
Dec 7 11:05:44 localhost kernel: [1696878.216623] [UFW BLOCK] IN=eth0 OUT= MAC=f2:3c:92:47:2d:91:84:78:ac:57:aa:c1:08:00 SRC=45.129.33.10 DST=104.200.29.8 LEN=40 TOS=0x00 PREC=0x00 TTL=247 ID=18836 PROTO=TCP SPT=49143 DPT=28182 WINDOW=1024 RES=0x00 SYN URGP=0
This is fixed in the latest refactor in master to improve stability and make Spotbit more lightweight. If Spotbit crashes now, we should be able to pinpoint specific issues easily.