Many readlink syscalls
the8472 opened this issue · 5 comments
The "Gathering file list..." phase seems to take a lot of time and peeking at it with strace
indicates that it's doing a ton of readlink
syscalls for each file. I'm not sure what the purpose is but that can probably be optimized?
duperemove 0.13
stat("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64/fn._addcarry_u64.html", {st_mode=S_IFREG|0644, st_size=4809, ...}) = 0
readlink("/.snapshots", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64/fn._addcarryx_u64.html", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
stat("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64/fn._addcarryx_u64.html", {st_mode=S_IFREG|0644, st_size=4874, ...}) = 0
readlink("/.snapshots", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64/fn._mm512_madd52hi_epu64.html", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
stat("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64/fn._mm512_madd52hi_epu64.html", {st_mode=S_IFREG|0644, st_size=5791, ...}) = 0
readlink("/.snapshots", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64/fn._mm_madd52hi_epu64.html", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
stat("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64/fn._mm_madd52hi_epu64.html", {st_mode=S_IFREG|0644, st_size=5729, ...}) = 0
I think this is caused by this code since realpath uses readlink internally:
Lines 315 to 329 in 38675a9
I don't quite understand why the sanitizing is necessary on every file.
Instead only canonicalizing the starting points should do the job. Once that's an absolute, symlink-free path adding the descendants will remain so since directory traversal isn't following symlinks either.
I think this is caused by this code since realpath uses readlink internally:
Lines 315 to 329 in 38675a9
I don't quite understand why the sanitizing is necessary on every file. Instead only canonicalizing the starting points should do the job. Once that's an absolute, symlink-free path adding the descendants will remain so since directory traversal isn't following symlinks either.
You are correct: call from walk_dir()
should not use realpath(3)
I am currently reworking the whole scan phase to fix many of its issues:
0.27 [jack:~/git/duperemove] git diff --stat origin/master | tail -1
21 files changed, 1103 insertions(+), 1285 deletions(-)
I will work on this issue after that WIP is merged
Cool, will the refactoring also reduce the memory usage (previous versions didn't use as much) or should I file a separate issue for that?
Cool, will the refactoring also reduce the memory usage (previous versions didn't use as much) or should I file a separate issue for that?
Yes it does
Last time I checked, on some extreme cases (for instance, the "large number of identical small files"), I had up to 90% memory reduction for the scan phase
Hello @the8472
I believe your issue has been fixed in the latest release
Feel free to reopen if you still face the issue
Thank you for your contribution !