markfasheh/duperemove

Many readlink syscalls

the8472 opened this issue · 5 comments

The "Gathering file list..." phase seems to take a lot of time and peeking at it with strace indicates that it's doing a ton of readlink syscalls for each file. I'm not sure what the purpose is but that can probably be optimized?

duperemove 0.13

stat("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64/fn._addcarry_u64.html", {st_mode=S_IFREG|0644, st_size=4809, ...}) = 0
readlink("/.snapshots", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64/fn._addcarryx_u64.html", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
stat("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64/fn._addcarryx_u64.html", {st_mode=S_IFREG|0644, st_size=4874, ...}) = 0
readlink("/.snapshots", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64/fn._mm512_madd52hi_epu64.html", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
stat("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64/fn._mm512_madd52hi_epu64.html", {st_mode=S_IFREG|0644, st_size=5791, ...}) = 0
readlink("/.snapshots", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64/fn._mm_madd52hi_epu64.html", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
stat("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64/fn._mm_madd52hi_epu64.html", {st_mode=S_IFREG|0644, st_size=5729, ...}) = 0

I think this is caused by this code since realpath uses readlink internally:

duperemove/file_scan.c

Lines 315 to 329 in 38675a9

/*
* Sanitize the file name and get absolute path. This avoids:
*
* - needless filerec writes to the db when we have
* effectively the same filename but the components have extra '/'
*
* - Absolute path allows the user to re-run this hash from
* any directory.
*/
if (realpath(path, abspath) == NULL) {
fprintf(stderr, "Error %d: %s while getting path to file %s. "
"Skipping.\n",
errno, strerror(errno), path);
goto out;
}

I don't quite understand why the sanitizing is necessary on every file.
Instead only canonicalizing the starting points should do the job. Once that's an absolute, symlink-free path adding the descendants will remain so since directory traversal isn't following symlinks either.

I think this is caused by this code since realpath uses readlink internally:

duperemove/file_scan.c

Lines 315 to 329 in 38675a9

/*
* Sanitize the file name and get absolute path. This avoids:
*
* - needless filerec writes to the db when we have
* effectively the same filename but the components have extra '/'
*
* - Absolute path allows the user to re-run this hash from
* any directory.
*/
if (realpath(path, abspath) == NULL) {
fprintf(stderr, "Error %d: %s while getting path to file %s. "
"Skipping.\n",
errno, strerror(errno), path);
goto out;
}

I don't quite understand why the sanitizing is necessary on every file. Instead only canonicalizing the starting points should do the job. Once that's an absolute, symlink-free path adding the descendants will remain so since directory traversal isn't following symlinks either.

You are correct: call from walk_dir() should not use realpath(3)

I am currently reworking the whole scan phase to fix many of its issues:

0.27 [jack:~/git/duperemove] git diff --stat origin/master | tail -1
 21 files changed, 1103 insertions(+), 1285 deletions(-)

I will work on this issue after that WIP is merged

Cool, will the refactoring also reduce the memory usage (previous versions didn't use as much) or should I file a separate issue for that?

Cool, will the refactoring also reduce the memory usage (previous versions didn't use as much) or should I file a separate issue for that?

Yes it does
Last time I checked, on some extreme cases (for instance, the "large number of identical small files"), I had up to 90% memory reduction for the scan phase

Hello @the8472
I believe your issue has been fixed in the latest release

Feel free to reopen if you still face the issue

Thank you for your contribution !