Unity-Technologies/unity-cache-server

JavaScript heap out of memory

SciPlayD opened this issue · 51 comments

This is new in v6.2.3, I am pretty certain.

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
 1: 0x100063a65 node::Abort() [/Users/cache/node-v11.6.0-darwin-x64/bin/node]
 2: 0x100064104 node::errors::TryCatchScope::~TryCatchScope() [/Users/cache/node-v11.6.0-darwin-x64/bin/node]
 3: 0x10019d9a7 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/Users/cache/node-v11.6.0-darwin-x64/bin/node]
 4: 0x10019d944 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/Users/cache/node-v11.6.0-darwin-x64/bin/node]
 5: 0x1005a2122 v8::internal::Heap::FatalProcessOutOfMemory(char const*) [/Users/cache/node-v11.6.0-darwin-x64/bin/node]
 6: 0x1005a4653 v8::internal::Heap::CheckIneffectiveMarkCompact(unsigned long, double) [/Users/cache/node-v11.6.0-darwin-x64/bin/node]
 7: 0x1005a0b88 v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) [/Users/cache/node-v11.6.0-darwin-x64/bin/node]
 8: 0x10059ed45 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/Users/cache/node-v11.6.0-darwin-x64/bin/node]
 9: 0x1005a7a7b v8::internal::Heap::FinalizeIncrementalMarkingIfComplete(v8::internal::GarbageCollectionReason) [/Users/cache/node-v11.6.0-darwin-x64/bin/node]
10: 0x1005b8e6f v8::internal::IncrementalMarkingJob::Task::RunInternal() [/Users/cache/node-v11.6.0-darwin-x64/bin/node]
11: 0x1000ba156 node::PerIsolatePlatformData::RunForegroundTask(std::__1::unique_ptr<v8::Task, std::__1::default_delete<v8::Task> >) [/Users/cache/node-v11.6.0-darwin-x64/bin/node]
12: 0x1000b987a node::PerIsolatePlatformData::FlushForegroundTasksInternal() [/Users/cache/node-v11.6.0-darwin-x64/bin/node]
13: 0x1009fbd2f uv__async_io [/Users/cache/node-v11.6.0-darwin-x64/bin/node]
14: 0x100a0b898 uv__io_poll [/Users/cache/node-v11.6.0-darwin-x64/bin/node]
15: 0x1009fc1b6 uv_run [/Users/cache/node-v11.6.0-darwin-x64/bin/node]
16: 0x100048371 node::Start(v8::Isolate*, node::IsolateData*, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) [/Users/cache/node-v11.6.0-darwin-x64/bin/node]
17: 0x10004784c node::Start(uv_loop_s*, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) [/Users/cache/node-v11.6.0-darwin-x64/bin/node]
18: 0x1000474b2 node::Start(int, char**) [/Users/cache/node-v11.6.0-darwin-x64/bin/node]
19: 0x100001034 start [/Users/cache/node-v11.6.0-darwin-x64/bin/node]

What stands out immediately to me is that you're running node v11.6 .. we're still on v8.12 LTS.

I'll downgrade tonight. It wasn't clear whether v8.x was the required version, or the minimum required version.

Required. I wanted to move to v10 (latest LTS) but there are some incompatibilities that require investigation and fixing.

We haven't yet seen an OOM error but have noticed a massive increase in RAM consumption with 6.2.3. We're using the fs module and are seeing each of the 8 node workers that we're running consuming up to 1.5GB of RAM. I'm under the impression that workers were barely using any RAM on 6.2.0.

We're running node 10.13.

@mleenhardt we don't support node v10 yet. Not saying that would make a difference in RAM usage but please use the version specified in the package.json file (or .nvmrc)

Published a fix for the worker memory leak in v6.2.4-beta2

Fixed in v6.2.4 (published to NPM)

We upgraded from 6.1.2 to 6.2.4 and we're seeing a huge increase in memory use. We have four workers, and after a day they are each using between 1.5 and 3.5GB of physical memory. This is on node 8.14 on linux.

I'll investigate further, thanks for reporting.

@corngood I'm having no luck reproducing growing memory allocations. I'm using the playback tools in the feature/diagnostic-tools branch to continuously send and receive data (a mix of GET and PUT commands) to a server with multiple workers.

I'm launching the server like this node --inspect main -w 2, which allows the chrome memory profiler to attach at chrome://inspect. From there, the "Allocation instrumentation on timeline" tool shows allocs/deallocs in real time.

I'm not seeing any leaking memory using this method. I'm testing on MacOS though I don't believe platform should make a difference.

Here's the log for that run (until I killed when it was using ~8GiB) in case it's useful: https://clbin.com/IsNpN

How much memory use are you seeing at the process level? 6.1.2 has been running since I restarted it this morning and it's using about 100MiB (physical) / worker.

I saw a high watermark of about 190M, settling back to about 150M at idle. I'm going to let this run for a while and keep pounding on it to see if it's a slow leak.

Thanks for the log - that crash in there should be fixed as well.

I also see Cannot read property 'clientAddress' of null in your log - will look into that as well.

@SGDavid what are you seeing in terms of memory consumption in 6.2.4?

Looks like ~250MB for an active process and ~95MB for an idle process. Node.js v8.12.0 on a Mac Mini

That seems normal. Maybe a platform specific issue after all.

The cache server I assessed (I have four in three different locations) can experience very heavy load at times. It has not had a single issue since the 6.2.4 release.

I did some profiling today against a Linux server (using Docker) and I'm seeing similar memory usage to MacOS. Not really sure where to go from here in terms of troubleshooting, unless someone else can chime in for some more evidence.

@stephen-palmer thanks for investigating. I'll try again and see if I can reproduce it with the memory profiler.

In case you're interested, here's what I was seeing:

memory use

The big spike with all the swap was when I was testing 6.2.4 on node 10. The smaller spike on the 14/15th was 6.2.4 on node 8.14. Since the 15th I've been using 6.1.2 on node 8.14, which seems to be ok with memory, but suffers from #68.

I had another try today. Same version of node (8.14), with 6.2.4. This is from top after about 6 hours.

24525 unityca+  20   0 5219092 3.833g   5216 S   0.0 24.6   4:27.55 node
24532 unityca+  20   0 2198016 1.073g   4940 S   0.0  6.9   2:41.22 node
24526 unityca+  20   0 3160652 0.979g   5036 S   0.0  6.3   2:35.36 node
24531 unityca+  20   0  983732  72036   4820 S   0.0  0.4   0:07.33 node
24514 unityca+  20   0  981948  69876   4912 S   0.0  0.4   0:56.87 unity-cache-ser

The server's starting to swap, so I'm going to have to revert for now. I haven't had time to do any profiling, unfortunately.

Thanks David. Obviously something going on. Can you post your config dump from the command line? (using 6.2.4, since it now aggregates defaults and shows a complete config) or send it directly to me if you don't want it public.

Sure, here's my config.yml:

Cache:
  defaultModule: cache_fs
  options:
    processor:
      putWhitelist: []
    cache_ram:
      cachePath: .cache_ram
      pageSize: 100000000
      maxPageCount: 10
      minFreeBlockSize: 1024
      persistence: true
      persistenceOptions:
        autosave: true
        autosaveInterval: 10000
        throttledSaves: false
    cache_fs:
      cachePath: "/mnt/fast/unitycache/cache5.0"
      cleanupOptions:
        expireTimeSpan: P30D
        maxCacheSize: 180000000000
Mirror:
  options:
    queueProcessDelay: 2000
    connectionIdleTimeout: 10000

and the output of --dump-config:

Cache:
  defaultModule: cache_fs
  options:
    processor:
      putWhitelist: []
    cache_ram:
      cachePath: .cache_ram
      pageSize: 100000000
      maxPageCount: 10
      minFreeBlockSize: 1024
      persistence: true
      persistenceOptions:
        autosave: true
        autosaveInterval: 10000
        throttledSaves: false
      highReliability: false
      highReliabilityOptions:
        reliabilityThreshold: 2
        saveUnreliableVersionArtifacts: true
        multiClient: false
    cache_fs:
      cachePath: .cache_fs
      cleanupOptions:
        expireTimeSpan: P30D
        maxCacheSize: 0
      persistence: true
      persistenceOptions:
        autosave: true
        autosaveInterval: 10000
        throttledSaves: false
      highReliability: false
      highReliabilityOptions:
        reliabilityThreshold: 2
        saveUnreliableVersionArtifacts: true
        multiClient: false
Mirror:
  addresses: []
  options:
    queueProcessDelay: 2000
    connectionIdleTimeout: 10000
Server:
  port: 8126
  options:
    allowIpv6: false
Global:
  logLevel: 3

I went back and scoured the diff between 6.1.2 and 6.2.4, and found some potential global memory leaks in and around the same area as the previous fix. I've cleaned that up and published v6.2.5-beta0. npm install -g unity-cache-server@beta or get the branch https://github.com/Unity-Technologies/unity-cache-server/tree/release/v6.2.5

@stephen-palmer Thanks. I've been running that for about 5 hours, and I'm at:

22556 unityca+  20   0 5048152 3.709g   7692 S   0.0 23.8   3:59.00 node
22545 unityca+  20   0 1082784  90552   7752 S   0.0  0.6   0:16.23 node
22551 unityca+  20   0 1025156  79484   7768 S   0.0  0.5   0:13.81 node
22546 unityca+  20   0  964708  58764   7688 S   0.0  0.4   0:11.57 node
22534 unityca+  20   0  963592  56400   7740 S   0.0  0.3   0:24.62 unity-cache-ser

So there's still one process using a lot more memory than 6.1.2, but it doesn't seem to be increasing, and the others are much lower than 6.2.4. I'll keep it running and see if remains stable.

the one with high usage is definitely odd. still holding steady?

Pretty much. The other workers have gone up a bit:

22556 unityca+  20   0 5045592 3.801g   6828 S   0.0 24.4   7:23.88 node
22546 unityca+  20   0 1158136 220640   6868 S   0.0  1.4   0:30.90 node
22545 unityca+  20   0 1080224 142348   6876 S   0.0  0.9   0:29.10 node
22551 unityca+  20   0 1037512 123872   6932 S   0.0  0.8   0:21.65 node
22534 unityca+  20   0  989492  83740   6872 S   0.0  0.5   0:47.28 unity-cache-ser

The one using lots of memory seems to get to that state fairly quickly and stay there (will confirm next time I restart it). I'll continue to monitor it.

Is there something it does that's proportional to the number of assets in the cache? Our cache is ~180GiB and has 1.4M files. How does that compare to your test scenarios?

It's not proportional to cache size, but possibly proportional to current library size - i.e. the number of files that are accessed in a full sync of your project. Still I would expect all workers to eventually normalize based on random(ish) client distribution.

I just started a large refresh on my machine ("Importing through CacheServer") after switching unity versions, and while it's in progress I can see one worker with high CPU usage where the physical memory use has crept up to 1.4g now. That's in addition to the one using 3.6g.

We have about 75k assets, assuming find Library/metadata -type f -name \*.info | wc -l is a good estimate.

I just restarted the server, and initially it uses around 40m of physical memory per worker. I'm trying to see where it jumps way up. Here's a recent log sample:

 in 0.023 seconds (1.43 MB/sec)
Jan 25 12:09:37 jenkins.hb-studios.com unity-cache-server[672]: [Cluster:1] Sent 26 of 26 requested files (26 chunks) totaling 153.54 KB in 0.008 seconds (18.74 MB/sec)
Jan 25 12:10:35 jenkins.hb-studios.com unity-cache-server[672]: [Cluster:3] Sent 24 of 24 requested files (24 chunks) totaling 57.76 KB in 0.02 seconds (2.82 MB/sec)
Jan 25 12:12:07 jenkins.hb-studios.com unity-cache-server[672]: [Cluster:2] Sent 268 of 268 requested files (268 chunks) totaling 787.62 KB in 0.16 seconds (4.81 MB/sec)
Jan 25 12:12:23 jenkins.hb-studios.com unity-cache-server[672]: [Cluster:2] Sent 137600 of 137600 requested files (371370 chunks) totaling 16.67 GB in 295.841 seconds (57.69 MB/sec)
Jan 25 12:13:09 jenkins.hb-studios.com unity-cache-server[672]: [Cluster:3] Sent 26 of 26 requested files (26 chunks) totaling 153.54 KB in 0.02 seconds (7.5 MB/sec)
Jan 25 12:14:47 jenkins.hb-studios.com unity-cache-server[672]: [Cluster:3] Sent 8 of 8 requested files (8 chunks) totaling 57.13 KB in 0.003 seconds (18.6 MB/sec)
Jan 25 12:18:38 jenkins.hb-studios.com unity-cache-server[672]: [Cluster:3] Sent 244 of 244 requested files (244 chunks) totaling 584.26 KB in 0.141 seconds (4.05 MB/sec)
Jan 25 12:18:48 jenkins.hb-studios.com unity-cache-server[672]: [Cluster:1] Sent 236 of 236 requested files (236 chunks) totaling 565.15 KB in 0.117 seconds (4.72 MB/sec)
Jan 25 12:20:45 jenkins.hb-studios.com unity-cache-server[672]: [Cluster:3] Sent 137600 of 137600 requested files (371370 chunks) totaling 16.67 GB in 236.885 seconds (72.05 MB/sec)
Jan 25 12:32:33 jenkins.hb-studios.com unity-cache-server[19109]: [Cluster:1] Sent 90 of 90 requested files (92 chunks) totaling 400.4 KB in 0.051 seconds (7.67 MB/sec)
Jan 25 12:34:27 jenkins.hb-studios.com unity-cache-server[19109]: [Cluster:4] Sent 264 of 264 requested files (264 chunks) totaling 704.54 KB in 0.163 seconds (4.22 MB/sec)
Jan 25 12:35:18 jenkins.hb-studios.com unity-cache-server[19109]: [Cluster:2] Sent 4 of 4 requested files (4 chunks) totaling 18.21 KB in 0.002 seconds (8.89 MB/sec)
Jan 25 12:36:50 jenkins.hb-studios.com unity-cache-server[19109]: [Cluster:3] Sent 14 of 14 requested files (21 chunks) totaling 563.77 KB in 0.017 seconds (32.39 MB/sec)
Jan 25 12:38:18 jenkins.hb-studios.com unity-cache-server[19109]: [Cluster:2] Sent 242 of 242 requested files (242 chunks) totaling 579.49 KB in 0.141 seconds (4.01 MB/sec)
Jan 25 12:43:34 jenkins.hb-studios.com unity-cache-server[19109]: [Cluster:1] Sent 26 of 26 requested files (26 chunks) totaling 153.54 KB in 0.021 seconds (7.14 MB/sec)
Jan 25 12:45:55 jenkins.hb-studios.com unity-cache-server[19109]: [Cluster:2] Sent 26 of 26 requested files (26 chunks) totaling 153.54 KB in 0.004 seconds (37.49 MB/sec)

Perhaps those large requests (130k) are the problem? How does that compare to your test data?

not that big .. testing with only about 1500 files. We've got a very large customer project that I will try to use for testing instead - should be comparable to your case.

Even so, the allocations in question should be all garbage collected eventually.

Ultimately if I can't get it under control, I might just back out of this change that added soft locking to files that are open for read. The purpose was to prevent a client from overwriting a file that was open by one or more other clients, because Node's streaming APIs do not lock files at the system level.

However, in practice, the chance of an overwrite is low.

Which change is that exactly? I can easily test with it reverted if you like.

If writes are done to temporary files and then atomically renamed, concurrency shouldn't be a problem, right?

I just had one of the workers jump up to 10g:

19127 unityca+ 20 0 13.294g 9.780g 585gg2 S 0.0 62.8 11:24.75 node

Is be9541e the change you were considering reverting?

@corngood yes that is the change, but I don't want to just revert it directly.

I've published 6.2.5-beta1 that drops the ClusterMessages class that is the source of all of the memory overhead. Instead it reimplements reference counting directly with the cluster module.

@stephen-palmer I don't see it in the repo yet, but once it's up I'll give it a try. Thanks.

it should be there if you install with the beta tag npm install -g unity-cache-server@beta

I've been running from the repository, so it would be easier for me to test changes if you push them here, to a test branch, or to your personal fork. Could you do that? Ideally all the npm releases would be tagged on here anyway.

pushed release/v6.2.5

Thanks. I'll run it for a few days and let you know how it goes.

It didn't go well:

capture

I had to kill it because the server was unresponsive. One worker had 8+GiB resident.

sigh ok, thanks. The more I look at this the more I realize this whole approach to locking files is wrong, especially in the context of cluster processes. Besides that, it's an edge case that can also be mitigated by using high reliability mode (which will disallow overwriting an existing version altogether). I'm just going to rip it out.

release/v6.2.5 branch pushed with ref counting removed completely for CacheFS. I would appreciate one last test to confirm memory usage is back to 6.1 levels. If not, I've been chasing rabbits.

I've cleaned up the branch, and prepared a final v6.2.5 release with this and a few more minor fixes. I'll sit on it for a day or two to give you guys time to test if you would like to. I appreciate the help so far!

Thanks. I've just started it up, and I'll let you know how it goes tomorrow.

I ran 6.1 all day yesterday and the highest I saw a worker get was 1.8g, which is a lot obviously, but that seems to be the maximum, and it's stable at that size. Today on 6.2.5 one of the workers was at 6g after a couple of hours. I'm still not sure what triggers the memory increase exactly. I did a full import myself after starting it, and it only went up to 1.2g or so.

I briefly tried to bisect the two, but most of the in-between revs had problems that caused them not to start. I had to stop doing that on our production server and go back to 6.1 for now.

Thanks David. That's very strange. Maybe something during PUT transactions since your full GET didn't cause a spike.

I'm closing this with changes that are released in v6.2.5 that removes the system causing the known memory spikes. Let's reopen a new ticket when appropriate to dig into your issue @corngood

Sounds good. Sorry I haven't had any more info recently. 6.1 has been stable aside from the CPU usage problem from #68. I'll open a new issue when I get a chance to do more investigation.

Note for projects with a lot of cached files, it can be helpful to run the cache server using those options:

/usr/bin/nodejs --max-old-space-size=4096 /usr/lib/node_modules/unity-cache-server/main.js

this should not be necessary after removal of the file locking mechanism in 6.2.5+. i.e. memory overhead should not scale with cache size - only with concurrent connections, and in that case only for the duration of a client session. Persistent memory growth at this point would point to a leak, so please open an issue accordingly.

done #114 :)