LycheeOrg/Lychee-Docker

Fails to start with large number of files stored on GlusterFS

gareththered opened this issue · 8 comments

I have Lychee running on a Docker Swarm with GlusterFS as persistent storage for config and images.

Since installing, I've added approx. 3000 images. Unfortunately, Lychee now fails to start. Monitoring the logs shows that it stalls on **** Set Permissions **** after which a couple of minutes later Docker times out and restarts the service. This continues ad-infinitum.

This is down to GlusterFS's slow performance with many files and it seems that chowning and chmoding each image takes too long, even if their ownership and permissions are correct in the 1st place. (approx. 40 seconds on GlusterFS, while a local ZFS directory takes less than a second).

I've cloned entrypoint.sh and removed /uploads from the arguments of those two commands, before bind mounting my copy over the original, and it now works.

My question therefore is - does it have to recursively set permissions on the /uploads directory each time it starts? Would it be better just to set the permission non-recursively on the top level directory if required, then (somehow) alert the user if the permissions are wrong further down?

d7415 commented

I'm reluctant to change the behaviour for an edge case. The permissions check is useful for a few scenarios, including tools messing with storage mounted from elsewhere or simple migration from a non-Docker instance.

Changing the scripts like you have is certainly an option, taking responsibility for the permissions there.

To confirm, was that 40 seconds per image?

That's sounds a reasonable enough reason to leave something in. How about checking the ownership and permissions first, instead of blindly overwriting them even if they're correct? Something similar to:

find /conf/user.css /conf/.env /sym /uploads \( ! -user "$USER" -o ! -group "$USER" \) -exec chown "$USER":"$USER" \{\} \;

find /conf/user.css /conf/.env /sym /uploads \( ! -perm -ug+w -o ! -perm -ugo+rX \) -exec chmod ug+w,ugo+rX \{\} \;

This is really fast if the ownership and/or permissions are correct. It seems that GlusterFS reads fast, but writes slowly. If all the images have the wrong ownership/permissions then it will begin to fix them. If however, Docker believes the service hasn't started and restarts it, it will continue from where it left off until all the ownership/permissions are correct, at which point the service will start.

By the way, it was around 40 seconds to write ownership or permissions on all 3000 or so images - not one (thankfully!!)

d7415 commented

I like the find idea in principle. I'll try to have a look and hopefully incorporate it tomorrow.

By the way, it was around 40 seconds to write ownership or permissions on all 3000 or so images - not one (thankfully!!)

Phew!

d7415 commented

Ok, so after some testing (with ~1000 small images on a local volume), find was quicker than chmoding/chowning without checking when the permissions were correct, but when they were wrong it was much slower - multiple seconds for each operation instead of hundredths.

As a compromise, I tried running find over directories first (using chmod and chown recursively) and then checking for individual files afterwards. This gives a comparable worst case and, for best case (all permissions correct), was still fractionally quicker than the old code in my tests.

I ended up with:

find /sym /uploads -type d \( ! -user "$USER" -o ! -group "$USER" \) -exec chown -R "$USER":"$USER" \{\} \;
find /sym /uploads -type d \( ! -perm -ug+w -o ! -perm -ugo+rX \) -exec chmod -R ug+w,ugo+rX \{\} \;
find /conf/user.css /conf/.env /sym /uploads \( ! -user "$USER" -o ! -group "$USER" \) -exec chown "$USER":"$USER" \{\} \;
find /conf/user.css /conf/.env /sym /uploads \( ! -perm -ug+w -o ! -perm -ugo+rX \) -exec chmod ug+w,ugo+rX \{\} \;

@gareththered Before I commit them, do these work for you?

Fwiw, my original thinking with this issue was just a flag to turn this off, so thank-you for this much better solution!

I've carried out some basic testing by docker execing into the running image.

FYI, I'm working with just over 9,000 files, not the 3,000 I initially said - I don't know where that figure came from! Of course, many of those 9,000 are the resized ones created on import by Lychee, so I have nowhere near that many images.

First, I set all ownership to be incorrect, which should take the same time as setting the correct ownership:

root@dc89912ecbd0:/# time chown -R 1001 uploads/

real	0m25.070s
user	0m0.085s
sys	0m2.021s

Then, set incorrect permissions. It seems to take a similar time to setting ownership.

root@dc89912ecbd0:/# time chmod -R g-w uploads/

real	0m24.225s
user	0m0.111s
sys	0m1.122s

Next, I time using the find commands you're proposing, which I've placed in a shell script so that I can time them:

echo "**** Set Permissions ****"
# Laravel needs to be able to chmod user.css for no good reason
find /sym /uploads -type d \( ! -user "$USER" -o ! -group "$USER" \) -exec chown -R "$USER":"$USER" \{\} \;
find /sym /uploads -type d \( ! -perm -ug+w -o ! -perm -ugo+rX \) -exec chmod -R ug+w,ugo+rX \{\} \;
find /conf/user.css /conf/.env /sym /uploads \( ! -user "$USER" -o ! -group "$USER" \) -exec chown "$USER":"$USER" \{\} \;
chown www-data:"$USER" /conf/user.css
usermod -a -G "$USER" www-data
find /conf/user.css /conf/.env /sym /uploads \( ! -perm -ug+w -o ! -perm -ugo+rX \) -exec chmod ug+w,ugo+rX \{\} \;

Running the above for the 1st time results in the following as it fixes all ownership and permissions:

root@dc89912ecbd0:/# time ./test.sh 
**** Set Permissions ****

real	0m51.583s
user	0m0.232s
sys	0m3.520s

The above shows that it takes around the same time to run as running chown and chmod recursively.

Finally, I run it again, but this time every ownership and permission should be good already, so it runs much quicker:

root@dc89912ecbd0:/# time ./test.sh
**** Set Permissions ****

real	0m2.538s
user	0m0.078s
sys	0m0.349s

So, it looks very good from my point of view.

Think about your test with incorrect ownership/permissions being much slower on a local volume, I tested with the -exec {} + variant (where it builds the exec'd command with long argument list) but got rather disappointing results:

root@dc89912ecbd0:/# time ./test.sh 
**** Set Permissions ****

real	1m39.615s
user	0m0.516s
sys	0m6.701s

Nearly double the time.

d7415 commented

So, it looks very good from my point of view.

🎉

Nearly double the time.

Ah, that's annoying - I had been meaning to test that but forgot, so thanks again! Strange though...

I did a quick test of your original proposed find but with +:

find /conf/user.css /conf/.env /sym /uploads \( ! -user "$USER" -o ! -group "$USER" \) -exec chown "$USER":"$USER" \{\} +
find /conf/user.css /conf/.env /sym /uploads \( ! -perm -ug+w -o ! -perm -ugo+rX \) -exec chmod ug+w,ugo+rX \{\} +

and the results were very similar to the current leaders (no-find and -type d). But if that still breaks for your use, I guess the directory-first version is the best option.

What we need to remember is that with any of the find variants discussed here, it will succeed in the end, even if does means a few restarts by Docker. Each time it restarts it will have fewer changes to make on the next run.

That is a vast improvement over the original recursive chmod & chown system, where if constantly failed to start even if all the ownerships/permissions were correct.

I've also set the $STARTUP_DELAY environment variable to zero, which give me an additional 30 seconds to get started.

d7415 commented

it will succeed in the end

A very good point!

The :dev tag will be rebuilt with this as part of the scheduled run this evening.

Thanks for your help!