overflow files: too many open files
lfoppiano opened this issue ยท 5 comments
I'm trying to re-train the gloVe embeddings on a large corpus.
The cooccur
command generates a lot of new files (28745 to be precise ๐ ) but then when recombining them it crashes, see below:
Writing cooccurrences to disk..............28745 files in total.
Merging cooccurrence files: processed 0 lines.Unable to open file overflow_1021.bin.
Errno: 24
Error description: Too many open files
I was wondering whether it would be possible to increase the size of these files (it seems they are 512 Mb) to, let's say 1 or 2Gb?
The option -overflow-length
is not very clear to me on how to be used:
-overflow-length <int>
Limit to length <int> the sparse overflow array, which buffers cooccurrence data that does not fit in the dense array, before writing to disk.
This value overrides that which is automatically produced by '-memory'. Typically only needs adjustment for use with very large corpora.
The alternative I might need to increase the maximum number of open files I guess.
Any suggestions?
It looks like, based on reading the code in cooccur.c
, it is splitting the files up based on how much memory you have:
rlimit = 0.85 * (real)memory_limit * 1073741824/(sizeof(CREC));
while (fabs(rlimit - n * (log(n) + 0.1544313298)) > 1e-3) n = rlimit / (log(n) + 0.1544313298);
max_product = (long long) n;
overflow_length = (long long) rlimit/6; // 0.85 + 1/6 ~= 1
There is a flag -overflow-length
to change the size of the files. However, presumably you'll run out of memory if you try to use that, unless it's off by a factor of 30. You could try adding more memory or using less data.
Another option would be to change merge_files to do two steps if there are more than 1000 (or ulimit) temp files, but I personally won't make that change unless this turns into a frequent issue, and I can guarantee no one else here will make such a change. If you make a PR with such a change, though, we'd be happy to integrate it.
@AngledLuffa thanks!
I have changed the --memory and increase it to 90.0 (as the machine has a lot of ram) and the files now are 9.5Gb each.
To be on the safe side I also increased the maximum number of opened files.
Let's see in a couple of weeks if it will run fine ๐
Thanks!