High-level squashfuse optimization opportunities

Question

High-level squashfuse optimization opportunities

haampie opened this issue 2 years ago · 5 comments

I've documented some performance issues with squashfuse here: https://github.com/haampie/squashfs-mount/. I'm observing about a 1.5x increase in compile time of LLVM when mounting compilers from a squashfs file using squashfuse compared to just using (lib)mount.

Is this overhead expected?

Answer 1 · 2022-06-21T23:54:59.000Z

You would definitely expect overhead from FUSE over an in-kernel driver, yes.

Answer 2 · 2022-06-22T07:15:30.000Z

Okay, then I'll stick to the kernel version for now. Thanks!

Answer 3 · 2022-09-16T12:58:46.000Z

FWIW: when using squashfuse_ll instead of squashfuse I get a 10x speedup for du -sh mountpoint/.

Perf tells me squashfuse spends the vast majority of time decompressing whereas squashfuse_ll spends like 5% of the time there.

Is there any reason to keep the high-level version if the low-level version performs so much better?

Answer 4 · 2022-09-17T03:01:47.000Z

The high-level version uses a simpler FUSE API, which has wider availability. A number of platforms (Minix, NetBSD) only support the high-level API. Unfortunately, the high-level API doesn't map one-to-one(-ish) with kernel VFS operations, but instead talks to a library that manages things like inode allocation, that make it inherently slower. Something that hits many different inodes, like du, should be particularly bad.

If squashfuse_ll works better for you, then I recommend sticking with it! But I'd like to keep squashfuse available on other platforms, and there's no harm in leaving the high-level version around.

Let me rename this ticket to something about optimizing high-level squashfuse, since that seems to be where this has landed. Please go ahead and explain how you did your testing, and share what results you got. Then we can use this as an opportunity for anybody who wants to spend time optimizing high-level squashfuse.

Answer 5 · 2022-09-17T16:51:46.000Z

Using squashfuse the timing is consistently:

$ time du -sh /x
43G	/x

real	0m12.548s
user	0m0.040s
sys	0m0.592s

$ time du -sh /x
43G	/x

real	0m12.450s
user	0m0.024s
sys	0m0.569s

$ time du -sh /x
43G	/x

real	0m12.397s
user	0m0.059s
sys	0m0.526s

there's no caching effects.

squashfuse_ll is 13x better the first and 45x better the second and later runs:

$ squashfuse_ll file.squashfs /x

$ time du -sh /x
42G	/x

real	0m0.902s
user	0m0.040s
sys	0m0.405s

$ time du -sh /x
42G	/x

real	0m0.275s
user	0m0.018s
sys	0m0.167s

$ time du -sh /x
42G	/x

real	0m0.269s
user	0m0.005s
sys	0m0.176s

mount is best:

$ time du -sh /x
42G	/x

real	0m0.527s
user	0m0.020s
sys	0m0.497s

$ time du -sh /x
42G	/x

real	0m0.108s
user	0m0.032s
sys	0m0.075s

$ time du -sh /x
42G	/x

real	0m0.109s
user	0m0.028s
sys	0m0.080s

Perf shows squashfuse spends all its time decompressing:

# Children      Self  Command     Shared Object       Symbol                                                   
# ........  ........  ..........  ..................  .........................................................
#
    44.07%    44.06%  squashfuse  libzstd.so.1.5.2    [.] ZSTD_decompressBlock_internal.part.13
            |          
             --44.04%--ZSTD_decompressBlock_internal.part.13

    16.41%    16.41%  squashfuse  libzstd.so.1.5.2    [.] _HUF_decompress4X1_usingDTable_internal_bmi2_asm_loop
            |          
             --16.40%--_HUF_decompress4X1_usingDTable_internal_bmi2_asm_loop