Mirrorbrain: stop scanning dirs too many times
Closed this issue · 1 comments
benoit74 commented
I dig a bit in Mirrorbrain and found something rather surprising in our custom script which launch mirror scanning operations.
In order to start scanning, we first get a list of directories to scan, with three types:
ALLDIRS
for mirrors which have all filesZIMDIRS
for mirrors which have ony ZIMsWMDIRS
for mirrors which have only wikimedia ZIMs
And then, for each mirror and each directory found, we call mb scan -d <dir> <mirror>
The problem is that:
mb scan
is recursing the scan, i.e. if we passzim
as a directory parameter, it will scan thezim
directory and all its subdirectoriesALLDIRS
andZIMDIRS
contains the whole tree hierarchy- e.g. [ 'zim', 'zim/ifixit', ..., 'release', 'release/browsers', 'release/browsers/chrome', ...]
WMDIRS
is probably not impacted because we look for specific subfolders
This means that files are scanned as many times as they are deeply nested in the hierarchy. Most files are scanned at least twice (e.g. a file in zim/ifixit
will be scanned once for zim
and once for zim/ifixit
). But some files are deeply nested and scanned many more times.
I assume this is not what was intended, WDYT?