Oxen-AI/Oxen

no way to reset an ongoing commit (related to error of too many files open)

alefminus opened this issue · 4 comments

Once there are a few "add"-ed files, there is no way to reset back to "no files added" state.

Additionally, it seems that the file limit is a limit on the number of files that can be added (or updated?) in a single commit, since a commit requires also opening all of them simultaneously.

mkdir commit-opens-all-files-simultaneously
cd commit-opens-all-files-simultaneously
oxen init
for ((i=0;i<10;++i)); do touch $i.txt; done
oxen add *.txt
strace --decode-fds=path -f -e trace=file -o oxen_commit.trace oxen commit -m test-oxen-commit-open
cat oxen_commit.trace | grep -E '= ([0-9][0-9])<.*/.\.txt>

Output:

465418 <... openat resumed>)            = 44</home/alon/src/Oxen/bugs/commit-opens-all-files-simultaneously/9.txt>
465417 <... openat resumed>)            = 43</home/alon/src/Oxen/bugs/commit-opens-all-files-simultaneously/0.txt>
465421 <... openat resumed>)            = 45</home/alon/src/Oxen/bugs/commit-opens-all-files-simultaneously/3.txt>
465420 <... openat resumed>)            = 49</home/alon/src/Oxen/bugs/commit-opens-all-files-simultaneously/6.txt>
465419 <... openat resumed>)            = 48</home/alon/src/Oxen/bugs/commit-opens-all-files-simultaneously/8.txt>
465421 openat(AT_FDCWD</home/alon/src/Oxen/bugs/commit-opens-all-files-simultaneously>, "/home/alon/src/Oxen/bugs/commit-opens-all-files-simultaneously/2.txt", O_RDONLY|O_CLOEXEC) = 43</home/alon/src/Oxen/bugs/commit-opens-all-files-simultaneously/2.txt>
465418 openat(AT_FDCWD</home/alon/src/Oxen/bugs/commit-opens-all-files-simultaneously>, "/home/alon/src/Oxen/bugs/commit-opens-all-files-simultaneously/1.txt", O_RDONLY|O_CLOEXEC) = 45</home/alon/src/Oxen/bugs/commit-opens-all-files-simultaneously/1.txt>
465419 openat(AT_FDCWD</home/alon/src/Oxen/bugs/commit-opens-all-files-simultaneously>, "/home/alon/src/Oxen/bugs/commit-opens-all-files-simultaneously/7.txt", O_RDONLY|O_CLOEXEC) = 44</home/alon/src/Oxen/bugs/commit-opens-all-files-simultaneously/7.txt>
465420 openat(AT_FDCWD</home/alon/src/Oxen/bugs/commit-opens-all-files-simultaneously>, "/home/alon/src/Oxen/bugs/commit-opens-all-files-simultaneously/4.txt", O_RDONLY|O_CLOEXEC) = 46</home/alon/src/Oxen/bugs/commit-opens-all-files-simultaneously/4.txt>
465417 <... openat resumed>)            = 45</home/alon/src/Oxen/bugs/commit-opens-all-files-simultaneously/5.txt>

Seems the files are mostly opened at the same time (some fds are reused).

I tried following the code to see where the files are actually opened, I see that the hashing opens the file and it should be closed since the handle is not returned (the hash result is), I could use some pointers if you think this is something that could be an easy PR

Appreciate the testing you've been doing!

  1. Once there are a few "add"-ed files, there is no way to reset back to "no files added" state.

Did you try oxen restore --staged <file>

  1. Can you elaborate on the second point? It does open the files in parallel during commit in order to hash them, is this related to the ulimit test you did?

Ok - that explains the number of opened files.

Can you elaborate

Yes: the ulimit change I did was just to more easily reproduce the issue. The default (on my system) is 1024. ulimit lets your change per process the limits on various knobs, in this case the maximum number of open files a process can have. i.e. open(n+1) system call will return the an error.

The hashing in parallel causes a contention, it is possible all hashing will happen in parallel, which will use the maximum number of file descriptors (equal to the number of files committed).

I suggest a hashing thread (or async task, I think you are using tokio?) to limit the number of concurrent open files could cap the number of open files without a large code/architecture churn. Or actually I think this kind of maximum-thread-count is already implemented in pools, with something like rayon it should be easy. Have not tried it myself. Another option is just to locally handle it (i.e. match on the error and retry).

The async task is already using rayon and defaults to the number of cpus on the system. It turned out to be rocksdb that was opening many files as well, so added a MAX_OPEN_FILES limit default and env variable.