Too many open files
Closed this issue · 10 comments
$ storjshare --version
StorjShare: 7.0.3
Core: 4.1.0
Protocol: 0.9.1
kfs@2.2.4
I can't find an issue on this bug elsewhere so I'm posting, however not sure if this is fixed in 2.2.5 already.
At the moment when running lsof
it outputs 23270 open files. This seems like none of the leveldb files are ever closed.
{"level":"debug","message":"contact already in bucket, moving to tail","timestamp":"2016-11-05T15:00:16.026Z"}
{"level":"info","message":"handling storage consignment request from a69d4f2e3e97e4f26ece3d437b2825c5d7fed9e6","timestamp":"2016-11-05T15:00:16.060Z"}
{"level":"info","message":"authorizing data channel for a69d4f2e3e97e4f26ece3d437b2825c5d7fed9e6","timestamp":"2016-11-05T15:00:16.117Z"}
{"level":"info","message":"replying to message to 13af651c84a74787103c961ada33a9d057a98c05","timestamp":"2016-11-05T15:00:16.117Z"}
{"level":"debug","message":"not waiting on callback for message 13af651c84a74787103c961ada33a9d057a98c05","timestamp":"2016-11-05T15:00:16.133Z"}
{"level":"info","message":"data channel connection opened","timestamp":"2016-11-05T15:00:16.451Z"}
events.js:160
throw er; // Unhandled 'error' event
^
Error: IO error: /mnt/md3/storj/sharddata.kfs/196.s/002392.ldb: Too many open files
at Error (native)
Agreed, the real problem isn't scandir though, it's from kfs opening too many leveldb files and never closing them. scandir is simply complaining when it can't open a file, because of the kfs problem. My node crashes every single night with this error since the stress testing has been going on.
Could this be a side effect of storj-archived/core#566 ? My node is authorizing data channels including expected shard hash. I don't know at which point kfs will open the file. It is only a guess.
I don't think they are related. This error is a file descriptor leak that is hitting the hard limits set by the OS. If you run lsof
you can clearly see none of the leveldb files are ever closed.
StorjDrive-scott1-_2017-2-7.txt
{info} [Tue Feb 07 2017 00:02:35 GMT-0800 (PST)] received valid message from {"userAgent":"6.0.15","protocol":"1.1.0","hdKey":"xpub6AHweYHAxk1EhJSBctQD1nLWPog6Sy2eTpKQLExR1hfzTyyZQWvU4EYNXv1NJN7GpLYXnDLt4PzN874g6zSjAQdFCHZN7U7nbYKYVDUzD42","hdIndex":7839318,"address":"104.196.66.41","port":8470,"nodeID":"b1d711fc81594217351a993c29d1655b9133e3f1","lastSeen":1486454555861}
{info} [Tue Feb 07 2017 00:02:35 GMT-0800 (PST)] sending PUBLISH message to {"userAgent":"6.0.15","protocol":"1.1.0","address":"174.126.190.172","port":40228,"nodeID":"b0c7887e821132c99a2037fe0f25651f77d94c75","lastSeen":1486454555864}
{info} [Tue Feb 07 2017 00:02:35 GMT-0800 (PST)] sending PUBLISH message to {"userAgent":"6.1.2","protocol":"1.1.0","address":"client013.storj.dk","port":15076,"nodeID":"b17e436e0317a24f8ffdd35463993ff93949bd10","lastSeen":1486454555889}
{info} [Tue Feb 07 2017 00:02:35 GMT-0800 (PST)] replying to message to f892986a61e4d8f79e61c313d69c6346257902c7
{info} [Tue Feb 07 2017 00:02:36 GMT-0800 (PST)] received valid message from {"userAgent":"6.0.15","protocol":"1.1.0","address":"174.126.190.172","port":40228,"nodeID":"b0c7887e821132c99a2037fe0f25651f77d94c75","lastSeen":1486454556123}
{info} [Tue Feb 07 2017 00:02:36 GMT-0800 (PST)] received valid message from {"userAgent":"6.1.2","protocol":"1.1.0","address":"client013.storj.dk","port":15076,"nodeID":"b17e436e0317a24f8ffdd35463993ff93949bd10","lastSeen":1486454556342}
{info} [Tue Feb 07 2017 00:02:38 GMT-0800 (PST)] received valid message from {"userAgent":"6.1.2","protocol":"1.1.0","address":"139.60.66.116","port":38164,"nodeID":"b7c4e32a0b119b430fa4d60cadb5c56476cd20ca","lastSeen":1486454558764}
{info} [Tue Feb 07 2017 00:02:38 GMT-0800 (PST)] sending PING message to {"userAgent":"6.1.2","protocol":"1.1.0","address":"client016.storj.dk","port":15035,"nodeID":"b740eccb3402a4e60738fe53453ade56fbde7fd0","lastSeen":1486454558765}
{info} [Tue Feb 07 2017 00:02:38 GMT-0800 (PST)] received PING from b7c4e32a0b119b430fa4d60cadb5c56476cd20ca, sending PONG
{info} [Tue Feb 07 2017 00:02:38 GMT-0800 (PST)] replying to message to d00bd875a401d3a1fe518b15c85982d2319b60d7
{info} [Tue Feb 07 2017 00:02:39 GMT-0800 (PST)] received valid message from {"userAgent":"6.1.2","protocol":"1.1.0","address":"client016.storj.dk","port":15035,"nodeID":"b740eccb3402a4e60738fe53453ade56fbde7fd0","lastSeen":1486454559221}
{info} [Tue Feb 07 2017 00:02:42 GMT-0800 (PST)] received valid message from {"userAgent":"6.1.1","protocol":"1.1.0","address":"stx.myonlineportal.net","port":4005,"nodeID":"2c01a2fa2b2b5950bdec9fb79f29b691b2c7fd48","lastSeen":1486454562800}
{info} [Tue Feb 07 2017 00:02:42 GMT-0800 (PST)] received PING from 2c01a2fa2b2b5950bdec9fb79f29b691b2c7fd48, sending PONG
{info} [Tue Feb 07 2017 00:02:42 GMT-0800 (PST)] replying to message to 48741275cd76b04651e990378d84e3a7b326602f
{info} [Tue Feb 07 2017 00:02:42 GMT-0800 (PST)] received valid message from {"userAgent":"6.0.11","protocol":"1.0.0","address":"smthgroup.chph.ras.ru","port":36214,"nodeID":"b105681d008b4616993ff354d7bc12613313dc59","lastSeen":1486454562962}
{info} [Tue Feb 07 2017 00:02:42 GMT-0800 (PST)] sending PUBLISH message to {"userAgent":"6.1.2","protocol":"1.1.0","address":"188.187.160.127","port":43318,"nodeID":"b1d9753ff569497622d56f932edf81893590d663","lastSeen":1486454562967}
{info} [Tue Feb 07 2017 00:02:43 GMT-0800 (PST)] replying to message to 3fcdda92f6895d8330e9a43fa2d0e51add32581e
{error} [Tue Feb 07 2017 00:02:43 GMT-0800 (PST)] IO error: /home/anonymous/Documents/Storj1-10gb/storjshare-ce5fe1/contracts.db/000146.ldb: Too many open files
{info} [Tue Feb 07 2017 00:02:47 GMT-0800 (PST)] received valid message from {"userAgent":"6.0.15","protocol":"1.1.0","hdKey":"xpub6AHweYHAxk1EhJSBctQD1nLWPog6Sy2eTpKQLExR1hfzTyyZQWvU4EYNXv1NJN7GpLYXnDLt4PzN874g6zSjAQdFCHZN7U7nbYKYVDUzD42","hdIndex":951546,"address":"104.196.223.54","port":8462,"nodeID":"2913081bc65b63d4d302f001bd5126319c2be83c","lastSeen":1486454567842}
{info} [Tue Feb 07 2017 00:02:47 GMT-0800 (PST)] received PING from 2913081bc65b63d4d302f001bd5126319c2be83c, sending PONG
{info} [Tue Feb 07 2017 00:02:47 GMT-0800 (PST)] replying to message to aae60735927bc247864e13411f210ca3866d0d4c
{info} [Tue Feb 07 2017 00:02:49 GMT-0800 (PST)] received valid message from {"userAgent":"6.1.2","protocol":"1.1.0","address":"client023.storj.dk","port":15092,"nodeID":"b3a602428c673ad3d6686aa3cec00e3947b70643","lastSeen":1486454569811}
{info} [Tue Feb 07 2017 00:02:49 GMT-0800 (PST)] sending PUBLISH message to {"userAgent":"6.1.2","protocol":"1.1.0","address":"client018.storj.dk","port":15083,"nodeID":"b16b90e67a975b9dc444fc9203294b7a5cd23181","lastSeen":1486454569814}
{info} [Tue Feb 07 2017 00:02:49 GMT-0800 (PST)] sending PUBLISH message to {"userAgent":"6.1.1","protocol":"1.1.0","address":"client023.storj.dk","port":15059,"nodeID":"b12b2fc9b43a1d1309db56650e8528b12a62dd8d","lastSeen":1486454569839}
I'm having this error on the 4.0.10 gui client
{error} [Sat Feb 18 2017 10:19:57 GMT-0800 (PST)] Could not get usedSpace: IO error: /home/anonymous/Documents/Storj1-10gb/storjshare-ce5fe1/sharddata.kfs/251.s: Too many open files
{debug} [Sat Feb 18 2017 10:19:57 GMT-0800 (PST)] not sending an offer for the contract
{error} [Sat Feb 18 2017 10:19:57 GMT-0800 (PST)] Could not get usedSpace: IO error: /home/anonymous/Documents/Storj1-10gb/storjshare-ce5fe1/sharddata.kfs/251.s: Too many open files
{debug} [Sat Feb 18 2017 10:19:57 GMT-0800 ()] not sending an offer for the contract
{error} [Sat Feb 18 2017 10:19:57 GMT-0800 ()] Could not get usedSpace: IO error: /home/anonymous/Documents/Storj1-10gb/storjshare-ce5fe1/sharddata.kfs/251.s: Too many open files
{debug} [Sat Feb 18 2017 10:19:57 GMT-0800 ()] not sending an offer for the contract
{error} [Sat Feb 18 2017 10:19:57 GMT-0800 ()] Could not get usedSpace: IO error: /home/anonymous/Documents/Storj1-10gb/storjshare-ce5fe1/sharddata.kfs/251.s: Too many open files
{debug} [Sat Feb 18 2017 10:19:57 GMT-0800 ()] not sending an offer for the contract
{error} [Sat Feb 18 2017 10:19:57 GMT-0800 ()] Could not get usedSpace: IO error: /home/anonymous/Documents/Storj1-10gb/storjshare-ce5fe1/sharddata.kfs/251.s: Too many open files
{debug} [Sat Feb 18 2017 10:19:57 GMT-0800 ()] not sending an offer for the contract
{error} [Sat Feb 18 2017 10:19:57 GMT-0800 ()] Could not get usedSpace: IO error: /home/anonymous/Documents/Storj1-10gb/storjshare-ce5fe1/sharddata.kfs/251.s: Too many open files
{debug} [Sat Feb 18 2017 10:19:57 GMT-0800 ()] not sending an offer for the contract
I found the reason for this. kfs contains 257 leveldbs. Each leveldb keeps 4 files open. LOCK
, LOG
, MANIFEST
and one *.log
file. Even with 0 traffic we are starting with 1028 open files. These files are never closed.
The next problem is the p2p communication. storj-lib@v6.0.4
had a connection limit of 150. One release later it was unlimited and because of the kad-quasar issues we had more connection the last few days. The kad-quasar issues are fixed but it is still possible to have to many open connection -> to many open files.
Sending an OFFER is not increasing the open file counter.
I need a break. I will go on later. I would like to see if shard up and download will increase the file counter and if there is a way to disable leveldb LOG. That would give us 257 free file pointer.
I will place my first bet on storjshare status
printing out the used space. GUI is doing the same. I will disable that feature and set concurrent offer to 0. Lets see if these 257 leveldbs are all open again.
I have seen a few shard up and downloads on my logfile. I still have 1028 open files + 111 connections. That part is working as expected. I don't see any problems there.
I won the bet. I disabled the used space query here: https://github.com/littleskunk/core/blob/debug/lib/storage/adapters/embedded.js#L171-L173
Now I have only 4 open files + connection. The other 1024 files are closed.