BlockchainCommons/spotbit

Spotbit Downtime / Logs Limited

shannona opened this issue · 3 comments

We had some type of downtime on Spotbit on Sunday:

daemon.log:

Dec  7 11:05:05 localhost systemd[1]: spotbit.service: Main process exited, code=killed, status=9/KILL
Dec  7 11:05:05 localhost systemd[1]: spotbit.service: Failed with result 'signal'.

SIGKILL indicates to me that was user generated, but there was no one logged in at the time, so I assume some process choose to kill spotbit for some reason.

The user.log shows active usage up until shortly before that:

user.log:

Dec  7 11:04:26 localhost python3[431]: 127.0.0.1 - - [07/Dec/2020 11:04:26] "#033[37mGET /now/usd/kraken HTTP/1.1#033[0m" 200 -
Dec  7 11:04:26 localhost python3[431]: INFO:werkzeug:127.0.0.1 - - [07/Dec/2020 11:04:26] "#033[37mGET /now/usd/kraken HTTP/1.1#033[0m" 200 -
Dec  7 11:04:27 localhost python3[431]: INFO:spotbit:[2020-12-07 11:04:27.185816] | inserted into okcoin USDT 40000 times
Dec  7 11:04:51 localhost python3[431]: 127.0.0.1 - - [07/Dec/2020 11:04:51] "#033[37mGET /now/usd/kraken HTTP/1.1#033[0m" 200 -
Dec  7 11:04:51 localhost python3[431]: INFO:werkzeug:127.0.0.1 - - [07/Dec/2020 11:04:51] "#033[37mGET /now/usd/kraken HTTP/1.1#033[0m" 200 -
Dec  7 11:04:51 localhost python3[431]: 127.0.0.1 - - [07/Dec/2020 11:04:51] "#033[37mGET /now/usd/kraken HTTP/1.1#033[0m" 200 -
Dec  7 11:04:51 localhost python3[431]: INFO:werkzeug:127.0.0.1 - - [07/Dec/2020 11:04:51] "#033[37mGET /now/usd/kraken HTTP/1.1#033[0m" 200 -

There's a confusing mess of spotbit.err and spotbit.log logs in various places. The only error file, /root/spotbit.err, is three days old (e.g., a day before the crash) and the most up-to-date log file, /home/spotbit/.spotbit/spotbit.log, shows very little between 10-06 and 10-09. (It's also hard to read because most lines don't have a date.)

The system was in a very bad state when I found it. spotbit was not running. I restarted it, then I restarted it and tor, and still nothing was being passed to spotbit. I rebooted the system and this time tor didn't start right, and I rebooted the system a second time before everything finally came back into sync. I have to assume that's all due to something not shutting down cleanly, most likely the database, which was throwing locked messages after the first reboot.

So, I know this isn't much info, here's a list of suggested things that might help:

  1. Make sure spotbit is logging an error file
  2. See if we can improve spotbit.log to at least log date-time on all lines
  3. Investigate if anything in tor or spotbit might issue a SIGKILL
  4. See if there is any way to better restart spotbit cleanly after a messy SIGKILL
  5. Monitor our services, including spotbit (I'm going to probably take this one on with Christopher next week)

@javiervargas found the answer in the system logs: an out of memory error. That answers point #3 above, which was the reason for this particular crash. spotbit is on a tiny 1GB-memory machine, and it might be that's not sufficient memory for large lookups. I've tried to expand things by pushing the swap from 512MB to the recommend 1GB. If that's insufficient, the machine probably needs to be pushed to a larger size.

I leave the following possible issues:

  1. Make sure spotbit is logging an error file
  2. See if we can improve spotbit.log to at least log date-time on all lines
  3. Investigate if anything in tor or spotbit might issue a SIGKILL
  4. See if there is any way to better restart spotbit cleanly after a messy SIGKILL
  5. Monitor our services, including spotbit (I'm going to probably take this one on with Christopher next week)

This is the log:

Dec  7 11:05:05 localhost kernel: [1696838.638615] [  29318]    33 29318    53352        0   147456     2009             0 apache2
Dec  7 11:05:05 localhost kernel: [1696838.640057] [  29319]    33 29319    53352        0   147456     1999             0 apache2
Dec  7 11:05:05 localhost kernel: [1696838.641506] [  29320]    33 29320    53352        0   147456     2001             0 apache2
Dec  7 11:05:05 localhost kernel: [1696838.642947] [  29321]    33 29321    53354        0   147456     2003             0 apache2
Dec  7 11:05:05 localhost kernel: [1696838.644385] [  29322]    33 29322    53354        0   147456     2017             0 apache2
Dec  7 11:05:05 localhost kernel: [1696838.645828] [  29359]    33 29359    53354        1   147456     2001             0 apache2
Dec  7 11:05:05 localhost kernel: [1696838.647278] [  29360]    33 29360    53352        0   147456     2001             0 apache2
Dec  7 11:05:05 localhost kernel: [1696838.648712] [   4779]    33  4779    53287        1   147456     1964             0 apache2
Dec  7 11:05:05 localhost kernel: [1696838.650160] Out of memory: Kill process 431 (python3) score 862 or sacrifice child
Dec  7 11:05:05 localhost kernel: [1696838.651265] Killed process 431 (python3) total-vm:2000224kB, anon-rss:875860kB, file-rss:0kB, shmem-rss:0kB
Dec  7 11:05:05 localhost kernel: [1696838.761387] oom_reaper: reaped process 431 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Dec  7 11:05:05 localhost systemd[1]: spotbit.service: Main process exited, code=killed, status=9/KILL
Dec  7 11:05:05 localhost systemd[1]: spotbit.service: Failed with result 'signal'.
Dec  7 11:05:25 localhost kernel: [1696859.041998] [UFW BLOCK] IN=eth0 OUT= MAC=f2:3c:92:47:2d:91:84:78:ac:57:a8:41:08:00 SRC=181.30.28.51 DST=104.200.29.8 LEN=100 TOS=0x00 PREC=0x00 TTL=52 ID=48552 DF PROTO=UDP SPT=4000 DPT=11211 LEN=80
Dec  7 11:05:44 localhost kernel: [1696878.216623] [UFW BLOCK] IN=eth0 OUT= MAC=f2:3c:92:47:2d:91:84:78:ac:57:aa:c1:08:00 SRC=45.129.33.10 DST=104.200.29.8 LEN=40 TOS=0x00 PREC=0x00 TTL=247 ID=18836 PROTO=TCP SPT=49143 DPT=28182 WINDOW=1024 RES=0x00 SYN URGP=0

This is fixed in the latest refactor in master to improve stability and make Spotbit more lightweight. If Spotbit crashes now, we should be able to pinpoint specific issues easily.