BlockchainCommons/spotbit

Tor Slow / Unresponsive on Server

watersnake1 opened this issue · 9 comments

Spotbit's tor server often becomes slow or completely unresponsive after a few hours to days. This issue is faced by anyone trying to access the server, regardless of location.

Issue occurs for all endpoints, including particular exchanges (eg /now/USD/kraken), which require minimal computation, so I believe it is unlikely it is an optimization issue. Additionally, these issues are not seen on a clearnet connection and have appeared recently. I have also noticed tor issues in the last month on my own personal tor devices, which have also required restarts to be reachable.

At the very least, a script to automatically kill and restart tor periodically may help.

I did some more research on this today.

I can pretty reliably get

http://h6zwwkcivy2hjys6xpinlnz2f74dsmvltzsd4xb42vinhlcaoe7fdeqd.onion/now/USD/coinbasepro

back in 2 seconds

But http://h6zwwkcivy2hjys6xpinlnz2f74dsmvltzsd4xb42vinhlcaoe7fdeqd.onion/now/USD

Takes 30 seconds

So that's a good start for comparing.

I straced the spotbit server and it seems to immediately receive the command and immediately respond to it: within those couples of seconds, no matter which of the queries it was.

Examining what the Tor server is doing is a little more difficult, because it's busier.

This is a quick response ("GET /now/USD/coinbasepro"):

socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_TCP) = 14
setsockopt(14, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
connect(14, {sa_family=AF_INET, sin_port=htons(5000), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
epoll_ctl(3, EPOLL_CTL_ADD, 14, {EPOLLIN|EPOLLOUT, {u32=14, u64=14}}) = 0
epoll_wait(3, [{EPOLLOUT, {u32=14, u64=14}}], 32, 108) = 1
getsockopt(14, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
getsockname(14, {sa_family=AF_INET, sin_port=htons(54486), sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 0
getsockopt(13, SOL_TCP, TCP_INFO, "\1\0\0\0\0\7w\1@\204\4\0@\234\0\0\250\5\0\0\30\2\0\0\0\0\0\0\0\0\0\0"..., [104]) = 0
ioctl(13, SIOCOUTQNSD, 0x55da6c88bb54)  = 0
write(13, "\27\3\3\2\23C\273,m\2\24O\322\205-\320+\346\217g\345\5\314\33\353M\235\220\301!\257\376"..., 536) = 536
epoll_ctl(3, EPOLL_CTL_MOD, 14, {EPOLLIN, {u32=14, u64=14}}) = 0
epoll_wait(3, [], 32, 104)              = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1)                = 0
getpid()                                = 479
getpid()                                = 479
getpid()                                = 479
getpid()                                = 479
epoll_wait(3, [{EPOLLIN, {u32=13, u64=13}}], 32, 1000) = 1
read(13, "\27\3\3\2\23", 5)             = 5
read(13, "8I\260\332\4\361\356\305\213\251|\355^\245\356W\3\353\v\20\2455\203\242$\0172\246\31\242%\26"..., 531) = 531
epoll_ctl(3, EPOLL_CTL_MOD, 14, {EPOLLIN|EPOLLOUT, {u32=14, u64=14}}) = 0
epoll_wait(3, [{EPOLLOUT, {u32=14, u64=14}}], 32, 172) = 1
sendto(14, "GET /now/USD/coinbasepro HTTP/1."..., 390, 0, NULL, 0) = 390
epoll_ctl(3, EPOLL_CTL_MOD, 14, {EPOLLIN, {u32=14, u64=14}}) = 0
epoll_wait(3, [], 32, 172)              = 0
getpid()                                = 479
getpid()                                = 479
getpid()                                = 479
getpid()                                = 479
epoll_wait(3, [], 32, 1000)             = 0
epoll_wait(3, [{EPOLLIN, {u32=14, u64=14}}], 32, 1000) = 1
recvfrom(14, "HTTP/1.0 200 OK\r\n", 16384, 0, NULL, NULL) = 17
recvfrom(14, 0x55da6c4acbad, 16367, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
getsockopt(13, SOL_TCP, TCP_INFO, "\1\0\0\0\0\7w\1@\204\4\0@\234\0\0\250\5\0\0\30\2\0\0\0\0\0\0\0\0\0\0"..., [104]) = 0
ioctl(13, SIOCOUTQNSD, 0x55da6c88bb54)  = 0
write(13, "\27\3\3\2\23\254e\210\357\317\345P\210\244\306M\232\272\232\267\36g\177SL\2101\362\34\177\275\345"..., 536) = 536
epoll_wait(3, [{EPOLLIN, {u32=14, u64=14}}], 32, 956) = 1
recvfrom(14, "Content-Type: application/json\r\n"..., 16384, 0, NULL, NULL) = 129
recvfrom(14, 0x55da6c4acc1d, 16255, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=14, u64=14}}], 32, 1) = 1
recvfrom(14, "{\"close\":19173.07,\"currency_pair"..., 16384, 0, NULL, NULL) = 180
recvfrom(14, 0x55da6c4acc50, 16204, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=14, u64=14}}], 32, 1) = 1
recvfrom(14, "", 16384, 0, NULL, NULL)  = 0
close(14)                               = 0

This is a slow response ("GET /now/USD"):

socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_TCP) = 14
setsockopt(14, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
connect(14, {sa_family=AF_INET, sin_port=htons(5000), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
epoll_ctl(3, EPOLL_CTL_ADD, 14, {EPOLLIN|EPOLLOUT, {u32=14, u64=14}}) = 0
epoll_wait(3, [{EPOLLOUT, {u32=14, u64=14}}], 32, 648) = 1
getsockopt(14, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
getsockname(14, {sa_family=AF_INET, sin_port=htons(54504), sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 0
getsockopt(13, SOL_TCP, TCP_INFO, "\1\0\0\0\0\7w\1\340\223\4\0@\234\0\0\250\5\0\0\30\2\0\0\0\0\0\0\0\0\0\0"..., [104]) = 0
ioctl(13, SIOCOUTQNSD, 0x55da6c88bb54)  = 0
write(13, "\27\3\3\2\02300\332[k\35\226\311\33649f\276\310\23\3034\375q\23\27\351s\335\350Ot"..., 536) = 536
epoll_ctl(3, EPOLL_CTL_MOD, 14, {EPOLLIN, {u32=14, u64=14}}) = 0
epoll_wait(3, [{EPOLLIN, {u32=10, u64=10}}], 32, 644) = 1
read(10, "\27\3\3\2\23", 5)             = 5
read(10, "Ph\10Zx_|B\211z\350\303<_\300\201\340\353Uu\350\n\237.\310\366K\vQ9{+"..., 531) = 531
epoll_wait(3, [{EPOLLIN, {u32=13, u64=13}}], 32, 396) = 1
read(13, "\27\3\3\2\23", 5)             = 5
read(13, "\252\324\30\247\177\271\250\177Z\203uN\206\347\364\320\7@^\300\205\265\24[\376\2TN\2243\351\247"..., 531) = 531
epoll_ctl(3, EPOLL_CTL_MOD, 14, {EPOLLIN|EPOLLOUT, {u32=14, u64=14}}) = 0
epoll_wait(3, [{EPOLLOUT, {u32=14, u64=14}}], 32, 145) = 1
sendto(14, "GET /now/USD HTTP/1.1\r\nHost: h6z"..., 378, 0, NULL, 0) = 378
epoll_ctl(3, EPOLL_CTL_MOD, 14, {EPOLLIN, {u32=14, u64=14}}) = 0
epoll_wait(3, [], 32, 141)              = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1)                = 0
getpid()                                = 479
getpid()                                = 479
getpid()                                = 479
getpid()                                = 479
epoll_wait(3, [], 32, 1000)             = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1000)             = 0
epoll_wait(3, [], 32, 1000)             = 0
epoll_wait(3, [], 32, 1000)             = 0
epoll_wait(3, [{EPOLLIN, {u32=10, u64=10}}], 32, 1000) = 1
read(10, "\27\3\3\2\23", 5)             = 5
read(10, "E\376L\241\24\213\256\365{\205}\10\365=k\223\271/^\25\275\212T\22\341cj_\205\377I\314"..., 531) = 531
epoll_wait(3, [], 32, 609)              = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [{EPOLLIN, {u32=13, u64=13}}], 32, 1000) = 1
read(13, "\27\3\3\2\23", 5)             = 5
read(13, "\0231\343\374\222l%\332\347\256\213\340\2\372\224\235aW\0360\n\241\347ky\35\354\372\366/H\r"..., 531) = 531
epoll_wait(3, [], 32, 369)              = 0
epoll_wait(3, [], 32, 1)                = 0
getpid()                                = 479
getpid()                                = 479
getpid()                                = 479
getpid()                                = 479
epoll_wait(3, [], 32, 1000)             = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1000)             = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1000)             = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1000)             = 0
epoll_wait(3, [], 32, 1000)             = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1000)             = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [{EPOLLIN, {u32=10, u64=10}}], 32, 1000) = 1
read(10, "\27\3\3\2\23", 5)             = 5
read(10, "\220\2619\343U\17\220\36\351g&(\353\0F\261\20s\307\215\362d\217\17\314`\31T#\315`\267"..., 531) = 531
epoll_wait(3, [], 32, 457)              = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 473)              = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, 
[{EPOLLIN, {u32=13, u64=13}}], 32, 525) = 1
read(13, "\27\3\3\2\23", 5)             = 5
read(13, "\371\3218\t\366z\370A\32x\0\263Q\275\222\"d\242\17\376\7@C.\267T\3273\200\362\3032"..., 531) = 531
epoll_wait(3, [], 32, 137)              = 0
epoll_wait(3, [], 32, 1)                = 0
getpid()                                = 479
getpid()                                = 479
getpid()                                = 479
getpid()                                = 479
epoll_wait(3, [], 32, 1000)             = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1000)             = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1000)             = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1000)             = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1000)             = 0
epoll_wait(3, [{EPOLLIN, {u32=13, u64=13}}], 32, 1000) = 1
read(13, "\27\3\3\2\23", 5)             = 5
read(13, "\22\342\33\36\254$k\312oOe\254\375(\fwU=\26b\221\363,\223`\35\344\3147\25pC"..., 531) = 531
epoll_wait(3, [], 32, 113)              = 0
getpid()                                = 479
getpid()                                = 479
getpid()                                = 479
getpid()                                = 479
epoll_wait(3, [{EPOLLIN, {u32=10, u64=10}}], 32, 1000) = 1
read(10, "\27\3\3\2\23", 5)             = 5
read(10, "\375\224\0\262E\305\233a\323\255{\246\364\205>\231\\7\211\340\300M\252\265\200g\16\233\313\305d\344"..., 531) = 531
epoll_wait(3, [], 32, 53)               = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1000)             = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1000)             = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1000)             = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1000)             = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 134)              = 0
epoll_wait(3, [], 32, 345)              = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_ctl(3, EPOLL_CTL_MOD, 13, {EPOLLIN|EPOLLOUT, {u32=13, u64=13}}) = 0
epoll_wait(3, [{EPOLLOUT, {u32=13, u64=13}}], 32, 516) = 1
write(13, "\27\3\3\2\23\212\26\211\31y?\333u\nZ\276\304|9\0G\207\324\2205\306U\32\23z\36\327"..., 536) = 536
epoll_ctl(3, EPOLL_CTL_MOD, 13, {EPOLLIN, {u32=13, u64=13}}) = 0
epoll_wait(3, [], 32, 516)              = 0
epoll_wait(3, [], 32, 1)                = 0
getpid()                                = 479
getpid()                                = 479
getpid()                                = 479
getpid()                                = 479
epoll_wait(3, [], 32, 1000)             = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [], 32, 1)                = 0
epoll_wait(3, [{EPOLLIN, {u32=10, u64=10}}], 32, 1000) = 1
read(10, "\27\3\3\2\23", 5)             = 5
read(10, "^\320\314\231\345\327/\376v\16\34\225\366)/1?\373 %q\323\352\350\266\177S\375;\320\365w"..., 531) = 531
epoll_wait(3, [{EPOLLIN, {u32=13, u64=13}}], 32, 840) = 1
read(13, "\27\3\3\2\23", 5)             = 5
read(13, "\302\364\204\242\"\344Y=f\200\320\310.\227\336||U\6P\0305\377;\264c\372\352\277=S\317"..., 531) = 531
epoll_wait(3, [], 32, 480)              = 0
getpid()                                = 479
getpid()                                = 479
getpid()                                = 479
getpid()                                = 479
epoll_wait(3, [{EPOLLIN, {u32=14, u64=14}}], 32, 269) = 1
recvfrom(14, "HTTP/1.0 200 OK\r\n", 16384, 0, NULL, NULL) = 17
recvfrom(14, 0x55da6c4acbad, 16367, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
getsockopt(13, SOL_TCP, TCP_INFO, "\1\0\0\0\0\7w\1\340\223\4\0@\234\0\0\250\5\0\0\30\2\0\0\0\0\0\0\0\0\0\0"..., [104]) = 0
ioctl(13, SIOCOUTQNSD, 0x55da6c88bb54)  = 0
write(13, "\27\3\3\2\23s\240\305y\210\227\2244~\306\371\270/l\273\372\323ncz\34\365\1\261\16f\355"..., 536) = 536
epoll_wait(3, [{EPOLLIN, {u32=14, u64=14}}], 32, 157) = 1
recvfrom(14, "Content-Type: application/json\r\n"..., 16384, 0, NULL, NULL) = 495
recvfrom(14, "", 15889, 0, NULL, NULL)  = 0
close(14)                               = 0

I don't see a lot of difference, other than the slow one having a lot more polling between the request and the response (reflecting the extra time it takes). Maybe someone else can find a kernel of info here.

In some cases, the /var/lib/tor/state file got rebuilt during the slow response, but not the fast, but that didn't seem to happen in this example.

But my assessment is: machine load is fine; machine memory is fine; machine I/O is fine; spotbit runs quickly and immediately; but there seems to be a consistent lag on some queries between tor and spotbit on the local machine.

Oh, I note one difference in my examples is the conversation with file handle 10 that occurs in the slow but not fast response: that's not consistent. I've seen it in both types, so I assume it's something extraneous and unrelated.

¿Have you get response using now/usd sometime? I'm starting checking code and always I get this:

127.0.0.1 - - [01/Dec/2020 21:06:12] "GET /now/usd HTTP/1.1" 500 -
[2020-12-01 21:07:22,969] ERROR in app: Exception on /now/usd [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2446, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1951, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1820, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1949, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1935, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "server.py", line 167, in now_noex
    cursor = db_n.execute(ms_check)
sqlite3.OperationalError: database is locked

Seems related to #16

Yep, I'm still seeing /now/USD fine, but the lag seems to have grown even more. Maybe a minute now?

I should correct that.

The server seems to respond correctly maybe half of the time. The other half it generates an Internal Server Error which likely correlates with the databse timeout. This is not new: I saw it when I was running SpotBit tests last week too. The only difference is in the frequency. They were maybe coming up 1 time in 4 or 5 before, and now they're coming up 1 time in 3, maybe as common as 1 time in 2. (And through further testing I've gotten 3 failures in a row, so that's not good.)

Can anyone say anything about how /now/usd and/now/usd/coinbasepro are dealt differently internally? Does /now/usd use the DB and /now/usd/coinbasepro not?

To answer my own question, they both use the DB.

This is /now/usd:

        for exchange in components:
            ms_check = f"SELECT timestamp FROM {exchange} LIMIT 1;"
...
            statement = f"SELECT * FROM {exchange} WHERE pair = '{ticker}' AND timestamp > {ts_delta} ORDER BY timestamp DESC LIMIT 1;"

And this is /now/usd/coinbasepro:

        statement = f"SELECT * FROM {exchange} WHERE pair = '{ticker}' ORDER BY timestamp DESC LIMIT 1;"

More thoughts on the issue with the Database:

  1. Clearly, the database ever locking is a problem, and one that's only going to get worse with usage. It suggests the need for additional concurrency, for which I'd suggest MariaDB, possibly in a master-slave configuration, with the master used for writing and the slave used for reading (but I'd test first just with a plain MariaDB, because master-slave databases are an administrative nightmare due to the regular death on crashes).

  2. Debugging is made more difficult by the fact that as far as I know there's no way to live-watch SQLite3 processes, and see what's being locked and why; and the fact that the spotbit.log file hasn't recorded in a month and a half. Just going to MariaDB would allow a lot of improved monitoring.

  3. However, I have some doubts that it's the heart of the slowdown issue, as my test work on the database comes back very quickly, as witnessed by the following main lookups from /now/USD:

sqlite> select timestamp from okcoin limit 1;
1604428560000
sqlite> select * from 'okcoin' WHERE pair='BTC-USD' AND timestamp > 1606941610 limit 1;
4404799|1604428560000|2020-11-03 18:36:00|BTC-USD|13765.62|13768.01|13765.62|13768.01|0.0208

My general suggestions:

  1. Figure out what is different between /now/USD and /now/USD/coinbasepro, as the time differential between those is very consistent (100% so)
  2. Consider a long-term Issue for going to a database built around more concurrency: it's possible the lookups and writes from the exchanges are causing the Timeouts for reading from Spotbit, and this will only get worse. I'm not sure it'll fix the #1 problem of latency on most lookups, but I think it's a lurking problem.

And, I still don't have a good idea of the source of the latency other than [1] we can pinpoint it to the /now/USD command, but not /now/USD/coinbasepro; and [2] it looks to me like it happens after spotbit ends it work and before Tor picks it up and sends it back, but I'm not an expert on reading traces, so I might have missed a nuance there.

Figuring out the difference between /now/USD and/now/USD/coinbasepro is the way to go the next step there though ...

This should be fixed by the latest refactors to improve reliability.