High-level squashfuse optimization opportunities
haampie opened this issue · 5 comments
I've documented some performance issues with squashfuse
here: https://github.com/haampie/squashfs-mount/. I'm observing about a 1.5x increase in compile time of LLVM when mounting compilers from a squashfs file using squashfuse compared to just using (lib)mount.
Is this overhead expected?
You would definitely expect overhead from FUSE over an in-kernel driver, yes.
Okay, then I'll stick to the kernel version for now. Thanks!
FWIW: when using squashfuse_ll
instead of squashfuse
I get a 10x speedup for du -sh mountpoint/
.
Perf tells me squashfuse
spends the vast majority of time decompressing whereas squashfuse_ll
spends like 5% of the time there.
Is there any reason to keep the high-level version if the low-level version performs so much better?
The high-level version uses a simpler FUSE API, which has wider availability. A number of platforms (Minix, NetBSD) only support the high-level API. Unfortunately, the high-level API doesn't map one-to-one(-ish) with kernel VFS operations, but instead talks to a library that manages things like inode allocation, that make it inherently slower. Something that hits many different inodes, like du, should be particularly bad.
If squashfuse_ll works better for you, then I recommend sticking with it! But I'd like to keep squashfuse available on other platforms, and there's no harm in leaving the high-level version around.
Let me rename this ticket to something about optimizing high-level squashfuse, since that seems to be where this has landed. Please go ahead and explain how you did your testing, and share what results you got. Then we can use this as an opportunity for anybody who wants to spend time optimizing high-level squashfuse.
Using squashfuse
the timing is consistently:
$ time du -sh /x
43G /x
real 0m12.548s
user 0m0.040s
sys 0m0.592s
$ time du -sh /x
43G /x
real 0m12.450s
user 0m0.024s
sys 0m0.569s
$ time du -sh /x
43G /x
real 0m12.397s
user 0m0.059s
sys 0m0.526s
there's no caching effects.
squashfuse_ll
is 13x better the first and 45x better the second and later runs:
$ squashfuse_ll file.squashfs /x
$ time du -sh /x
42G /x
real 0m0.902s
user 0m0.040s
sys 0m0.405s
$ time du -sh /x
42G /x
real 0m0.275s
user 0m0.018s
sys 0m0.167s
$ time du -sh /x
42G /x
real 0m0.269s
user 0m0.005s
sys 0m0.176s
mount
is best:
$ time du -sh /x
42G /x
real 0m0.527s
user 0m0.020s
sys 0m0.497s
$ time du -sh /x
42G /x
real 0m0.108s
user 0m0.032s
sys 0m0.075s
$ time du -sh /x
42G /x
real 0m0.109s
user 0m0.028s
sys 0m0.080s
Perf shows squashfuse spends all its time decompressing:
# Children Self Command Shared Object Symbol
# ........ ........ .......... .................. .........................................................
#
44.07% 44.06% squashfuse libzstd.so.1.5.2 [.] ZSTD_decompressBlock_internal.part.13
|
--44.04%--ZSTD_decompressBlock_internal.part.13
16.41% 16.41% squashfuse libzstd.so.1.5.2 [.] _HUF_decompress4X1_usingDTable_internal_bmi2_asm_loop
|
--16.40%--_HUF_decompress4X1_usingDTable_internal_bmi2_asm_loop