kiwix/container-images

Mirrorbrain: stop scanning dirs too many times

Closed this issue · 1 comments

I dig a bit in Mirrorbrain and found something rather surprising in our custom script which launch mirror scanning operations.

In order to start scanning, we first get a list of directories to scan, with three types:

  • ALLDIRS for mirrors which have all files
  • ZIMDIRS for mirrors which have ony ZIMs
  • WMDIRS for mirrors which have only wikimedia ZIMs

And then, for each mirror and each directory found, we call mb scan -d <dir> <mirror>

The problem is that:

  • mb scan is recursing the scan, i.e. if we pass zim as a directory parameter, it will scan the zim directory and all its subdirectories
  • ALLDIRS and ZIMDIRS contains the whole tree hierarchy
    • e.g. [ 'zim', 'zim/ifixit', ..., 'release', 'release/browsers', 'release/browsers/chrome', ...]
    • WMDIRS is probably not impacted because we look for specific subfolders

This means that files are scanned as many times as they are deeply nested in the hierarchy. Most files are scanned at least twice (e.g. a file in zim/ifixit will be scanned once for zim and once for zim/ifixit). But some files are deeply nested and scanned many more times.

I assume this is not what was intended, WDYT?

@benoit74 I have written this a long time ago. I don't remember, but this would not be too surprising to me.

Amount of files an mirrors were very different 10 years ago.

Would not be wasted time to see and do necessary fixes to speed-up this.