tech-greedy/singularity

Heap overflow

RobQuistNL opened this issue ยท 14 comments

Describe the bug
Heap overflow with big dataset

Version
1.1.1

To Reproduce
Run singularity on a big dataset

Expected behavior
It doesn't overflow;

2022-07-23T09:03:32.689Z [deal_preparation_worker] info: Started scanning. - {"id":"62da90af857640f0422868d4","name":"ngi-igenomes","path":"/storage0/slingshotv3/mnt/ngi-igenomes","minSize":18897856102,"maxSize":32641751449}

<--- Last few GCs --->

[2241119:0x4ac36f0] 59832069 ms: Mark-sweep 4071.5 (4138.8) -> 4071.4 (4138.8) MB, 3065.4 / 0.0 ms  (average mu = 0.961, current mu = 0.399) allocation failure scavenge might not succeed
[2241119:0x4ac36f0] 59835952 ms: Mark-sweep 4079.3 (4138.8) -> 4079.3 (4170.8) MB, 3879.8 / 0.0 ms  (average mu = 0.914, current mu = 0.001) allocation failure scavenge might not succeed


<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 0xb0a860 node::Abort() [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 2: 0xa1c193 node::FatalError(char const*, char const*) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 3: 0xcf9a6e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 4: 0xcf9de7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 5: 0xeb1685  [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 6: 0xec134d v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 7: 0xec404e v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 8: 0xe8558a v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationType, v8::internal::AllocationOrigin) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 9: 0x11fe2d6 v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
10: 0x15f2d39  [/home/master/.nvm/versions/node/v16.16.0/bin/node]
2022-07-23T18:52:37.651Z [deal_preparation_worker] info: 96fef747-8e99-4f82-860c-8f116f1d5f1b - Polled a new scanning request. - {"name":"ngi-igenomes","id":"62da90af857640f0422868d4"}
2022-07-23T18:52:37.652Z [deal_preparation_worker] info: Started scanning. - {"id":"62da90af857640f0422868d4","name":"ngi-igenomes","path":"/storage0/slingshotv3/mnt/ngi-igenomes","minSize":18897856102,"maxSize":32641751449}

<--- Last few GCs --->

[2241139:0x4ac36f0] 95197635 ms: Mark-sweep 4071.4 (4138.3) -> 4071.3 (4138.3) MB, 2863.7 / 0.0 ms  (average mu = 0.967, current mu = 0.414) allocation failure scavenge might not succeed
[2241139:0x4ac36f0] 95201130 ms: Mark-sweep 4079.1 (4138.3) -> 4079.1 (4170.3) MB, 3491.8 / 0.0 ms  (average mu = 0.926, current mu = 0.001) allocation failure scavenge might not succeed


<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 0xb0a860 node::Abort() [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 2: 0xa1c193 node::FatalError(char const*, char const*) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 3: 0xcf9a6e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 4: 0xcf9de7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 5: 0xeb1685  [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 6: 0xec134d v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 7: 0xec404e v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 8: 0xe8558a v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationType, v8::internal::AllocationOrigin) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 9: 0x11fe2d6 v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
10: 0x15f2d39  [/home/master/.nvm/versions/node/v16.16.0/bin/node]
2022-07-24T04:56:36.421Z [deal_preparation_worker] info: 728534b1-3b2d-47f7-b75c-fbdc2f4835c0 - Polled a new scanning request. - {"name":"ngi-igenomes","id":"62da90af857640f0422868d4"}
2022-07-24T04:56:36.421Z [deal_preparation_worker] info: Started scanning. - {"id":"62da90af857640f0422868d4","name":"ngi-igenomes","path":"/storage0/slingshotv3/mnt/ngi-igenomes","minSize":18897856102,"maxSize":32641751449}

<--- Last few GCs --->

[2241048:0x4ac36f0] 131427718 ms: Scavenge 4054.4 (4117.0) -> 4054.2 (4128.0) MB, 407.8 / 0.0 ms  (average mu = 0.989, current mu = 0.887) allocation failure 
[2241048:0x4ac36f0] 131428129 ms: Scavenge 4061.2 (4128.0) -> 4061.9 (4128.8) MB, 407.3 / 0.0 ms  (average mu = 0.989, current mu = 0.887) allocation failure 
[2241048:0x4ac36f0] 131429054 ms: Scavenge 4062.0 (4128.8) -> 4061.1 (4152.0) MB, 925.1 / 0.0 ms  (average mu = 0.989, current mu = 0.887) allocation failure 


<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 0xb0a860 node::Abort() [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 2: 0xa1c193 node::FatalError(char const*, char const*) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 3: 0xcf9a6e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 4: 0xcf9de7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 5: 0xeb1685  [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 6: 0xec134d v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 7: 0xec404e v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 8: 0xe8558a v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationType, v8::internal::AllocationOrigin) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 9: 0x11fe2d6 v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
10: 0x15f2d39  [/home/master/.nvm/versions/node/v16.16.0/bin/node]
2022-07-24T14:36:34.661Z [deal_preparation_worker] info: 557c12ec-25e5-44f0-b9e6-885015c89d30 - Polled a new scanning request. - {"name":"ngi-igenomes","id":"62da90af857640f0422868d4"}
2022-07-24T14:36:34.661Z [deal_preparation_worker] info: Started scanning. - {"id":"62da90af857640f0422868d4","name":"ngi-igenomes","path":"/storage0/slingshotv3/mnt/ngi-igenomes","minSize":18897856102,"maxSize":32641751449}

<--- Last few GCs --->

[2241076:0x4ac36f0] 166239514 ms: Scavenge 4060.4 (4123.2) -> 4060.2 (4134.0) MB, 372.5 / 0.0 ms  (average mu = 0.984, current mu = 0.503) allocation failure 
[2241076:0x4ac36f0] 166239889 ms: Scavenge 4067.2 (4134.0) -> 4067.9 (4134.7) MB, 372.5 / 0.0 ms  (average mu = 0.984, current mu = 0.503) allocation failure 
[2241076:0x4ac36f0] 166240291 ms: Scavenge 4068.0 (4134.7) -> 4067.1 (4158.0) MB, 401.0 / 0.0 ms  (average mu = 0.984, current mu = 0.503) allocation failure 


<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 0xb0a860 node::Abort() [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 2: 0xa1c193 node::FatalError(char const*, char const*) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 3: 0xcf9a6e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 4: 0xcf9de7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 5: 0xeb1685  [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 6: 0xec134d v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 7: 0xec404e v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 8: 0xe8558a v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationType, v8::internal::AllocationOrigin) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 9: 0x11fe2d6 v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
10: 0x15f2d39  [/home/master/.nvm/versions/node/v16.16.0/bin/node]
2022-07-25T01:07:33.389Z [deal_preparation_worker] info: d463cd8b-5854-440c-b33b-243b1e6c1a4b - Polled a new scanning request. - {"name":"ngi-igenomes","id":"62da90af857640f0422868d4"}
2022-07-25T01:07:33.389Z [deal_preparation_worker] info: Started scanning. - {"id":"62da90af857640f0422868d4","name":"ngi-igenomes","path":"/storage0/slingshotv3/mnt/ngi-igenomes","minSize":18897856102,"maxSize":32641751449}

<--- Last few GCs --->

[2241153:0x4ac36f0] 204089829 ms: Scavenge 4061.9 (4124.8) -> 4061.7 (4135.8) MB, 410.6 / 0.0 ms  (average mu = 0.982, current mu = 0.499) allocation failure 
[2241153:0x4ac36f0] 204090241 ms: Scavenge 4068.7 (4135.8) -> 4069.4 (4136.5) MB, 409.5 / 0.0 ms  (average mu = 0.982, current mu = 0.499) allocation failure 
[2241153:0x4ac36f0] 204093611 ms: Mark-sweep 4069.5 (4136.5) -> 4068.6 (4159.5) MB, 3369.1 / 0.0 ms  (average mu = 0.962, current mu = 0.414) allocation failure scavenge might not succeed


<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 0xb0a860 node::Abort() [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 2: 0xa1c193 node::FatalError(char const*, char const*) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 3: 0xcf9a6e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 4: 0xcf9de7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 5: 0xeb1685  [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 6: 0xec134d v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 7: 0xec404e v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 8: 0xe8558a v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationType, v8::internal::AllocationOrigin) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
 9: 0x11fe2d6 v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [/home/master/.nvm/versions/node/v16.16.0/bin/node]
10: 0x15f2d39  [/home/master/.nvm/versions/node/v16.16.0/bin/node]

@RobQuistNL The fix is documented here
3f7c84f
However, this dataset is really testing the edge cases so you may encounter other issues.
Please unclaim this dataset and we will test it out thoroughly.

Also test for this dataset gnss-ro-data.

2022-07-25T14:00:19.734Z [deal_preparation_worker] info: 790e598e-f615-47c9-b990-5b01f806828a - Polled a new generation request. - {"id":"62dea1bac177ad178465b9ce","datasetName":"gnss-ro-data","index":0}
2022-07-25T14:00:19.786Z [deal_preparation_worker] error: 790e598e-f615-47c9-b990-5b01f806828a - {"error":{"ok":0,"code":292,"codeName":"QueryExceededMemoryLimitNoDiskUseAllowed"}}

I have about 2TiB of RAM available so I can see how far we can push it... But there's very little to no information on indexing progress, whereas this would be something I'd like to see.

The tooling I made clearly showed the indexing progress, because with another dataset I've done I was hitting around 160 million files. It took about 6 days to index it completely using multiple parralel AWS CLI list commands.

Added indexing information in the next release.

I'm curious, is there a reason that increasing max heap memory (--max_old_space_size) doesn't work @RobQuistNL?

Never said it didn't - but thats a workaround :) I haven't tried it yet.

@zteeed it needs to be set as an env variable

@RobQuistNL heheh give it a try! I run into heap issues often when processing massive JSON files and increasing the heap limit always works. Especially considering that you have an incredible 2 TiB's worth of RAM, I think you shouldn't have a problem if you just jack up the heap limit. Give us a shout here if it works out, I'd love to know!

Thanks, I will try this way

Hi back @timelytree, we get the same error even with export NODE_OPTIONS="--max-old-space-size=4096"
Here is the steps I've done:

  • Clean all
  • export NODE_OPTIONS="--max-old-space-size=4096"
  • singularity init
  • singularity daemon
  • singularity prep create...
slingshot@hoyt:/home/slingshot/dataset1$ 2022-07-27T14:01:49.827Z [deal_tracking_service] info: Start update deal tracking - {}
2022-07-27T14:11:49.836Z [deal_tracking_service] info: Start update deal tracking - {}
2022-07-27T14:20:19.309Z [deal_preparation_worker] info: Created a new generation request. - {"id":"62e13e8acac877b7bd59bf63","name":"gnss-ro-data","index":0}
2022-07-27T14:21:12.013Z [deal_preparation_worker] info: Marking generation request to active - {"id":"62e13e8acac877b7bd59bf63","name":"gnss-ro-data","index":0}
2022-07-27T14:21:12.786Z [deal_preparation_worker] info: 733f21cf-a36b-4b53-834c-35e124c756bf - Polled a new generation request. - {"id":"62e149a377bbce5c9c90d5ab","datasetName":"gnss-ro-data","index":0}
2022-07-27T14:21:12.829Z [deal_preparation_worker] error: 733f21cf-a36b-4b53-834c-35e124c756bf - {"error":{"ok":0,"code":292,"codeName":"QueryExceededMemoryLimitNoDiskUseAllowed"}}

@zteeed that's a different error and should be fixed in the next release. Could you try this branch?
dev/native-s3-support

git clone https://github.com/tech-greedy/singularity.git
cd singularity
git checkout dev/native-s3-support
npm ci
npm run build
npm link
singularity daemon

I already tried and it worked for me for your dataset

@zteeed it needs to be set as an env variable

@RobQuistNL heheh give it a try! I run into heap issues often when processing massive JSON files and increasing the heap limit always works. Especially considering that you have an incredible 2 TiB's worth of RAM, I think you shouldn't have a problem if you just jack up the heap limit. Give us a shout here if it works out, I'd love to know!

What I mean is that this listing should be batched, not 1 huge recursive glob function. That might be a little difficult with actual file / directory listings but there should be tools to work around that.
When the actual AWS S3 API would be used you actually even have pagination and you could multithread it + keep track of progress easier. It looks like the indexing is a single long running function right now. Thats just not the way to do it with bigger datasets.

@RobQuistNL your concern should have been addressed with the dev branch above. The issue is not with glob function, it's with FUSE. The glob funciton is an async iterator but the FUSE will have to collect all objects in the directory before returning.

It's much better @liuziba, thanks !

Solved by #108