Daemon crash with OutOfMemory exception

Question

Daemon crash with OutOfMemory exception

Closed this issue 6 years ago · 27 comments

Last weekend Th3Van uploaded a hugh amount of shards. A few farmer reportet OutOfMemory exceptions. The daemon crashed and doesn't restart the farmer. I run different tests and here are my results:

Good news:
1.) No memory leak. Last weekend my farmer used 1GB memory. Th3Van stoped his uploads and over time my farmer decreased memory usage. At the end my farmer was running at 100MB-200MB again (lets call that usage normal). I was not able to find any leaks on GUI memory snapshots.
2.) Low memory usage for sending OFFER. I don't see any increase at all.
3.) High memory usage for locating renter (plus ~50MB-100MB). Memory back to normal a few minutes later. No big deal.

Bad news:
4.) High memory usage for mirror creation, shard up and downloads (plus ~100-200MB). Memory is still blocked even if the mirror creation is already finished. It takes a few hours to get back to normal.
5.) Uneven shard distribution in combination with no transfer limit will kill the prefered low responseTime farmer. I have seen up to 20 concurrent OFFER at the weekend. The prefered low responseTime farmer will send many OFFER at the same time. I would expect up to 100 OFFER in a short time. offerBackoffLimit will only stop sending OFFER but at that point the prefered low responseTime farmer is already trapped. He will be forced to open too many shard transfers at the same time. If he hits the magic 1.5GB mark he will crash with an OutOfMemory exception.
6.) No way to increase the memory. See #113

Unklear:
7.) I was not able to isolate mirror creation, shard up and download. It looks like only shard upload and mirror creation needs a lot of memory. Shard download should be fine. I will try to get some more samples to be sure.

Answer 1 · 2017-04-23T19:24:05.000Z

Workaround:
As long as #113 is not implemented you can increase the memory here: https://github.com/Storj/storjshare-daemon/blob/cf51c9fb47ce74a144d8180ce16361ae694d441b/lib/api.js#L121-L122
Change it to:

        path.join('node', 
        ['--max_old_space_size=4096',
        __dirname, '../script/farmer.js'),
        '--config', configPath],

Answer 2 · 2017-07-02T15:36:54.000Z

I have a memory problem since yesterday night ... check the screenshot
https://community.storj.io/file-upload/ZFA6LqcFgKzf82Fdh/Clipboard%20-%20July%202,%202017%202:12%20PM

I made a restart and changed maxconcurrency to 1 and I think this helped me with the ram issue.

But still I see high cpu usage from storage gui ( I'm running windows server 2008 R2 enterprise sp1 with xeon cpu and 32 gb ram and latest gui version )

Please help

Answer 3 · 2017-07-05T18:52:55.000Z

I am also seeing massive memory usage ~22 GiG !!

Answer 4 · 2017-08-12T17:47:12.000Z

@Vanilla_Buddha on rocketchat

Answer 5 · 2017-08-12T17:56:04.000Z

Is this a record? :)

Answer 6 · 2017-08-12T17:57:08.000Z

yup, looks like it :)

Answer 7 · 2017-08-12T18:15:58.000Z

I've seen similar issues: storj 6503 22.5 15.9 14261420 10513464 ? Sl 10:30 36:56 /usr/bin/node /usr/lib/node_modules/storjshare-daemon/script/farmer.js

Processes over 10GB

Answer 8 · 2017-08-13T20:12:34.000Z

I have the same issue only without crashing as I've managed a workaround by a really large swap size.
On CentOS 7 , MaxOffer is set at 20, MaxConnection is set at 500 , 15x2TB nodes:

On Windows 2016 Standard , MaxOffer is set at 8, MaxConnection is set at 300, 7x1TB nodes:

During the night >

During the day >

Lowering those 2 values helps a great deal in RAM economy. Second thing to mention is that RAM usage spikes to almost double during rush hours. A quick fix means restarting the nodes.

Answer 9 · 2017-08-15T19:09:46.000Z

Admins from Storj chat asked that I post my RAM usage here as well. Screenshot shows 5GB usage, but it has been as high as 7GB. I guess I wouldnt mind this if I had 32GB of RAM on my server, during the early stages here while Storj figures things out. But this is almost using up the remainder of my RAM. But as I said in chat - no one will want to use this software if it is eating up that much RAM - even 1GB is kind of a lot (my full SQL Server instance hovers around 1GB, a full database server).

2 things off top of my head:

Some kind of SYNC feature. Where upon upload, the data does need to immediately be pushed to the 5 other mirrors (how about push to 1 mirror initially). Push the mirror data out at a later time to remaining nodes, when the app is using less resources (or user can schedule mirror update windows, like 1am - 5am or something). An admin mentioned the longer they wait the longer the shards might get lost. I imagine if a node goes down and it wasnt mirrored. Maybe it just doesnt need to go out to all 5 at once.
In a Windows EXE program I created once, I limited the amount of RAM resources the program was allowed to use/consume at any given moment. This might make the program run slower, downloads and uploads slower - and we want STORJ to be FAST. BUT also allow the user to set this value from GUI or command line, so if someones does have 32GB+ RAM, they can set this to a higher value.
Figure out how FTP transfers data with little system resource usage. I check my Filezilla client when uploading to my server, and it barely uses any RAM. How is the data in STORJ pushed out to nodes? What protocol?

Answer 10 · 2017-08-15T19:13:12.000Z

Also, could an admin post a quick summary here of "workarounds" they suggest for now? There was some talk about setting MaxShardSize - maybe some additional permanent info here on the pros and cons of doing this. And someone mentioned Maxconcurrency above. I imagine that limits the number of upload/download connections at a given time. I imagine using either of these will limit our nodes performance? What other settings are missing?

Answer 11 · 2017-08-16T19:42:04.000Z

Also, could an admin post a quick summary here of "workarounds" they suggest for now?

Set this options to the default value:

  // Maximum number of concurrent connections to allow
  "maxConnections": 150,

  // Limits the number of pending OFFER message at one time
  "maxOfferConcurrency": 3,

  // Temporarily stop sending OFFER messages if more than this number of shard 
  // transfers are active
  "offerBackoffLimit": 4,

Answer 12 · 2017-09-08T13:41:10.000Z

This also might be linked to this issue: #247

Answer 13 · 2017-09-20T15:19:33.000Z

My first test was 5 month ago. I will repeat the test with the current version.

Answer 14 · 2017-10-01T22:41:27.000Z

Still the same result. I don't see any memory leaks. Shard transfers in any direction needs a lot of memory.

At the moment at least one renter is uploading 100MB shards. I have a fast 50MBit connection and can finish the transfer in less than 2 minutes. Every 20 minutes I get one shard. How about a low response time farmer with only 2MBit or less? I would expect an out of memory exception in 3-4 hours cause of to many open transfers.

Answer 15 · 2017-10-02T01:15:04.000Z

My most powerful node, my web server with a static IP address, use to hit upwards of 5GB RAM usage. This was 1-2 months ago. Storj may have made improvements, or just testing the network less... I was about to post that this server has not gone above 1.25GB of RAM the last month. But I looked right now and it is at almost 2.5... From what I can tell they may have adjusted based no available RAM in the system? As my other local PC with only 5GB of RAM has been hovering around .5 GB of RAM usage (server has 16GB RAM)... They should really post if they have been making updates in usage , or if this is all just random... Either way, again my entire SQL server instance running DBs for 20 clients never hits above 1GB usage, this # needs to get lower in order to be an epic blockchain project ( side thread coming soon... storj we need to talk soon... STORJ is not actually decentralized and has to start using farmers as bridges... instead of their 1 bridge that can fail at any given time, taking the entire system down - I am invested and rooting and confident in Storj ). Later.

Answer 16 · 2017-10-06T19:01:33.000Z

@braydonf I have an idea how we can reproduce it on staging.
1.) Limit download traffic of one Farmer to a very low value. Lets say 1KByte/s. Upload traffic unlimited. All other farmer unlimited.
2.) Start uploading shards. 100MB - 8GB shards like the main network.
3.) Wait for a few mirror creation. Farmer will start downloading the mirror very slow.
4.) Keep sending OFFER to get more and more mirros without finishing the transfers. Memory usage should increase.

To avoid tis problem reducing the number of farmer running on the same slow connection should help.

Answer 17 · 2017-10-07T18:40:34.000Z

I did exactly the steps mentioned from littleskunk above with the following stats:

2Mbit Download
350Mbit Upload
Node Size 8TB via config
4 Nodes on the System, 16GB RAM, only one limited (500/500Mbit Connection)
Windows Server 2012 R2 Datacenter

I got a lot of shards on the node and the node kept increasing on memory usage.
I stopped the node after it had 11,4GB RAM Usage.

Unfortunately i didnt screenshot the Taskmanager, but it was exactly what littleskunk described.
The Upload wasnt the problem, usage of about 16%. Download was pinned to 100%.
Will recreate the scenario right now and attach pictures.

Answer 18 · 2017-10-07T19:51:05.000Z

Thank you @stefanbenten

That explains a good workaround at the same time. Limiting the farmer Upload traffic should reduce the memory usage. A lower offerBackoffLimit should help as well.

Answer 19 · 2017-10-18T19:40:08.000Z

Here is an example i just took, with the decent stresstest going on.

I limited the top two Nodes to 2Mbit/s download each and made no limit for upload, which is max. 350Mbit/s.

As one can clearly see, the oder two nodes, which are not limited dont have that RAM usage problem.

Answer 20 · 2017-10-22T18:27:18.000Z

I actually got a daemon crash when my system ran out of memory even without any network usage. I have no idea what happened. Here:

Memory

Network

Strange Disk Activity

Answer 21 · 2017-10-26T00:17:37.000Z

From reports on the last release, memory usage is down from storj-archived/kfs#59

Closing for now, please reopen if still an issue. Even though this is closed, we should still aim to make considerable improvements to memory usage.

Answer 22 · 2017-10-26T00:20:18.000Z

Please reopen. I am sure it is still an issue because the KFS fix is only reducing the memory for one single transfer but the reason for the out of memory exception is too many parallel transfers and that is still possible.

Answer 23 · 2017-11-12T03:07:03.000Z

It seems Storj Share is consuming a lot of memory without freezing them up.
Last night, I have to do a force reboot due to its freezed, tried not to open other heavy usage apps, then go to bed... This morning, my PC was freezed completely.

In Task Manager, I see a lot of network traffic and STORJ disk I/O. But Storj's memory seems normal. I believe the memory issue is not from Chrome.

Windows 10 Pro v1709 build 16299.19

Answer 24 · 2018-04-24T04:13:08.000Z

What i observed with my nodes is that StorJ gains memory utilization over time. Network Utilization doesn't always correlate with the RAM it's consuming. At the 5 day mark i see nodes as high as 2GB / node, no matter the network utilization.

High speed, low RAM allocation. (1st day uptime during stress)

High speed, high RAM allocation. (3rd day uptime during stress)

End of stress tests (5th day of uptime)

An event then happens that reduces the storj node memory utilization (if the kernel does not crash first for out of RAM). Each Storj Node is back down to under 400MB/node with 5day uptime. However, RAM didn't get released from cache.

End of stress test - day 5 uptime - reduces storj direct RAM allocation and marks as cached.

There is 13.5GB modified RAM cached (windows describes this as Memory who's contents must be written to disk before it can be used.) If left alone, StorJ will ramp RAM utilization back up over time, and exceed RAM utilization and crash due to excessive RAM stuck in cache.
Note that there is very minimal network use at this time from Storj < 0.1Mbps. For hours at a time.

I understand this is bad for reputation, but my only option to avoid a crash is Stopping/Starting the nodes, rebooting, or closing and reopening storjshare. Or patiently waiting for a crash - and have a script clean it up, or reboot it when my monitor sets off. After one of those tasks, It then finally releases cache.

Stopping/Starting nodes during no traffic 2 down - 4 up (you can see the Cached RAM becoming free again, 2 steps down since 2 nodes are down)

Here's how the RAM looks after nodes have come back up fresh. Restarting nodes released 17GB of RAM.

Stress Test End

Answer 25 · 2018-04-24T13:11:42.000Z

I believe this issue is still haunting me. At the moment, I have to reboot my PC after a few days or monitoring Storj Share's hidden RAM usage before playing a game (This memory issue causes freezing my game entirely).

GUI 7.3.4
Daemon 5.3.1
Core 8.7.2
Protocol 1.2.0

Answer 26 · 2018-04-27T03:53:39.000Z

EDIT: Added Screenshots to my post above.
EDIT2: I've confirmed that the scenario above absolutely took place during the ending of the stress test.

Here's another theory to chew on. My RAM cache (hidden ram) was stuck very high about the same time that this last stress testing ended. Each node had been steadily sending an rec' traffic up until the testing stopped.

In a real world test, the user uploading data would care if the job completed or not, but with test data, who cares. The test script would just end abruptly in the middle of uploading chunks. That data then gets hung up in the node's RAM cache, sitting-waiting for the upload to complete, and never releases it from RAM.

Answer 27 · 2018-10-30T12:35:33.000Z

👋 Hey! Thanks for this contribution. Apologies for the delay in responding!

We've decided to rearchitect Storj, so that we can scale better. You can read more about this decision here. This means that we are entirely focused on v3 at the moment, in the storj/storj repository. Our white paper for v3 is coming very, very soon - follow along on the blog and in our Rocketchat.

As this repository is part of the v2 network, we're no longer maintaining this repository. I am going to close this for now. If you have any questions, I encourage you to jump on Rocketchat and ask them there. Thanks!