Fails to start with large number of files stored on GlusterFS
gareththered opened this issue · 8 comments
I have Lychee running on a Docker Swarm with GlusterFS as persistent storage for config and images.
Since installing, I've added approx. 3000 images. Unfortunately, Lychee now fails to start. Monitoring the logs shows that it stalls on **** Set Permissions ****
after which a couple of minutes later Docker times out and restarts the service. This continues ad-infinitum.
This is down to GlusterFS's slow performance with many files and it seems that chown
ing and chmod
ing each image takes too long, even if their ownership and permissions are correct in the 1st place. (approx. 40 seconds on GlusterFS, while a local ZFS directory takes less than a second).
I've cloned entrypoint.sh
and removed /uploads
from the arguments of those two commands, before bind mounting my copy over the original, and it now works.
My question therefore is - does it have to recursively set permissions on the /uploads
directory each time it starts? Would it be better just to set the permission non-recursively on the top level directory if required, then (somehow) alert the user if the permissions are wrong further down?
I'm reluctant to change the behaviour for an edge case. The permissions check is useful for a few scenarios, including tools messing with storage mounted from elsewhere or simple migration from a non-Docker instance.
Changing the scripts like you have is certainly an option, taking responsibility for the permissions there.
To confirm, was that 40 seconds per image?
That's sounds a reasonable enough reason to leave something in. How about checking the ownership and permissions first, instead of blindly overwriting them even if they're correct? Something similar to:
find /conf/user.css /conf/.env /sym /uploads \( ! -user "$USER" -o ! -group "$USER" \) -exec chown "$USER":"$USER" \{\} \;
find /conf/user.css /conf/.env /sym /uploads \( ! -perm -ug+w -o ! -perm -ugo+rX \) -exec chmod ug+w,ugo+rX \{\} \;
This is really fast if the ownership and/or permissions are correct. It seems that GlusterFS reads fast, but writes slowly. If all the images have the wrong ownership/permissions then it will begin to fix them. If however, Docker believes the service hasn't started and restarts it, it will continue from where it left off until all the ownership/permissions are correct, at which point the service will start.
By the way, it was around 40 seconds to write ownership or permissions on all 3000 or so images - not one (thankfully!!)
I like the find idea in principle. I'll try to have a look and hopefully incorporate it tomorrow.
By the way, it was around 40 seconds to write ownership or permissions on all 3000 or so images - not one (thankfully!!)
Phew!
Ok, so after some testing (with ~1000 small images on a local volume), find
was quicker than chmod
ing/chown
ing without checking when the permissions were correct, but when they were wrong it was much slower - multiple seconds for each operation instead of hundredths.
As a compromise, I tried running find
over directories first (using chmod
and chown
recursively) and then checking for individual files afterwards. This gives a comparable worst case and, for best case (all permissions correct), was still fractionally quicker than the old code in my tests.
I ended up with:
find /sym /uploads -type d \( ! -user "$USER" -o ! -group "$USER" \) -exec chown -R "$USER":"$USER" \{\} \;
find /sym /uploads -type d \( ! -perm -ug+w -o ! -perm -ugo+rX \) -exec chmod -R ug+w,ugo+rX \{\} \;
find /conf/user.css /conf/.env /sym /uploads \( ! -user "$USER" -o ! -group "$USER" \) -exec chown "$USER":"$USER" \{\} \;
find /conf/user.css /conf/.env /sym /uploads \( ! -perm -ug+w -o ! -perm -ugo+rX \) -exec chmod ug+w,ugo+rX \{\} \;
@gareththered Before I commit them, do these work for you?
Fwiw, my original thinking with this issue was just a flag to turn this off, so thank-you for this much better solution!
I've carried out some basic testing by docker exec
ing into the running image.
FYI, I'm working with just over 9,000 files, not the 3,000 I initially said - I don't know where that figure came from! Of course, many of those 9,000 are the resized ones created on import by Lychee, so I have nowhere near that many images.
First, I set all ownership to be incorrect, which should take the same time as setting the correct ownership:
root@dc89912ecbd0:/# time chown -R 1001 uploads/
real 0m25.070s
user 0m0.085s
sys 0m2.021s
Then, set incorrect permissions. It seems to take a similar time to setting ownership.
root@dc89912ecbd0:/# time chmod -R g-w uploads/
real 0m24.225s
user 0m0.111s
sys 0m1.122s
Next, I time using the find
commands you're proposing, which I've placed in a shell script so that I can time them:
echo "**** Set Permissions ****"
# Laravel needs to be able to chmod user.css for no good reason
find /sym /uploads -type d \( ! -user "$USER" -o ! -group "$USER" \) -exec chown -R "$USER":"$USER" \{\} \;
find /sym /uploads -type d \( ! -perm -ug+w -o ! -perm -ugo+rX \) -exec chmod -R ug+w,ugo+rX \{\} \;
find /conf/user.css /conf/.env /sym /uploads \( ! -user "$USER" -o ! -group "$USER" \) -exec chown "$USER":"$USER" \{\} \;
chown www-data:"$USER" /conf/user.css
usermod -a -G "$USER" www-data
find /conf/user.css /conf/.env /sym /uploads \( ! -perm -ug+w -o ! -perm -ugo+rX \) -exec chmod ug+w,ugo+rX \{\} \;
Running the above for the 1st time results in the following as it fixes all ownership and permissions:
root@dc89912ecbd0:/# time ./test.sh
**** Set Permissions ****
real 0m51.583s
user 0m0.232s
sys 0m3.520s
The above shows that it takes around the same time to run as running chown
and chmod
recursively.
Finally, I run it again, but this time every ownership and permission should be good already, so it runs much quicker:
root@dc89912ecbd0:/# time ./test.sh
**** Set Permissions ****
real 0m2.538s
user 0m0.078s
sys 0m0.349s
So, it looks very good from my point of view.
Think about your test with incorrect ownership/permissions being much slower on a local volume, I tested with the -exec {} +
variant (where it builds the exec
'd command with long argument list) but got rather disappointing results:
root@dc89912ecbd0:/# time ./test.sh
**** Set Permissions ****
real 1m39.615s
user 0m0.516s
sys 0m6.701s
Nearly double the time.
So, it looks very good from my point of view.
🎉
Nearly double the time.
Ah, that's annoying - I had been meaning to test that but forgot, so thanks again! Strange though...
I did a quick test of your original proposed find
but with +
:
find /conf/user.css /conf/.env /sym /uploads \( ! -user "$USER" -o ! -group "$USER" \) -exec chown "$USER":"$USER" \{\} +
find /conf/user.css /conf/.env /sym /uploads \( ! -perm -ug+w -o ! -perm -ugo+rX \) -exec chmod ug+w,ugo+rX \{\} +
and the results were very similar to the current leaders (no-find and -type d
). But if that still breaks for your use, I guess the directory-first version is the best option.
What we need to remember is that with any of the find
variants discussed here, it will succeed in the end, even if does means a few restarts by Docker. Each time it restarts it will have fewer changes to make on the next run.
That is a vast improvement over the original recursive chmod
& chown
system, where if constantly failed to start even if all the ownerships/permissions were correct.
I've also set the $STARTUP_DELAY
environment variable to zero, which give me an additional 30 seconds to get started.
it will succeed in the end
A very good point!
The :dev
tag will be rebuilt with this as part of the scheduled run this evening.
Thanks for your help!