alexandres/terashuf

seeking clarification on max file descriptors

stas00 opened this issue · 2 comments

You write:

When shuffling very large files, terashuf needs to keep open SIZE_OF_FILE_TO_SHUFFLE / MEMORY temporary files. Make sure to set the maximum number of file descriptors to at least this number. By setting a large file descriptor limit, you ensure that terashuf won't abort a shuffle midway, saving precious researcher time.

  • what is SIZE_OF_FILE_TO_SHUFFLE? Number of \n-terminated lines or total size in Bytes or GBytes or else?
  • what's MEMORY? GBs or Bytes?

Perhaps an example would be self-documenting, so if I have 70M records in a 600GB file and my MEMORY is set to the default 4, will I need

  1. 1 file descriptor 70000000/(4*2**30)
  2. 17.5M file descriptors 70000000/4 or something totally different?
  3. 150 file descriptors 600/4 (if size is in GBs)

Thank you!


edit after running it I see it's (3) in my attempts above. Perhaps the suggestion can be modified to: SIZE_OF_FILE_TO_SHUFFLE_IN_GBS / MEMORY_IN_GBS

Also a default file descriptor limit is 16K, so realistically the user will need to sort 64TB file using 4GB memory before they may run into problems. Therefore perhaps this note could also be added to help not to unnecessarily worry for an average user.

So I ended up sorting a 900GB file using 150GB memory and it needed just 6 additional file descriptors.

Very awesome program - took only about 1h! Thank you!

Hi @stas00 ,

Thank you for raising this point. I've updated the README in 1a769a9 to clarify the issue. Please have a look and see if it's clearer now.

Also a default file descriptor limit is 16K
Therefore perhaps this note could also be added to help not to unnecessarily worry for an average user.

On my machine the per-process limit was 1000, and I lost a large shuffle as a result! 😢 That's why I added the warning.

Your edit is great, @alexandres!

And you're right, on my home Ubuntu it's 1K as well! It was 16K on HPC. So your note is absolutely important!

Thank you!