Depth support for category fetching
Closed this issue · 0 comments
wetneb commented
We want to support fetching subcategories recursively up to some depth, like other tools like Petscan.
Here is a proposed architecture for this.
/**
* Fetches a category recursively, up to the given depth, from the MediaWiki API.
* The stream of FileRecords contains the filenames and mids, but not the related
* categories (which must be fetched separately).
* Set the depth to 0 to ignore subcategories.
*/
static Iterator<FileRecord> listCategoryMembers(String endpoint, String categoryName, int depth) {
// TODO
}
/**
* Fetches the direct subcategories of a given category, from the MediaWiki API.
* The supplied stream contains category names (TBD: with or without the `Category:` prefix?).
*/
static Iterator<String> fetchSubcategories(String endpoint, String categoryName) {
// TODO
}
/**
* Fetches the files which are direct members of a given category, from the MediaWiki API.
* The stream of FileRecords contains the filenames and mids, but not the related
* categories (which must be fetched separately).
*/
static Iterator<FileRecord> fetchDirectFileMembers(String endpoint, String categoryName) {
// TODO
}
/**
* Internal function used to iterate over the paginated results of the MediaWiki API
* when fetching files or categories. This function is used both by fetchSubcategories and
* by fetchDirectFileMembers.
* The `subcategories` parameter can be set to true to fetch categories and false to fetch files
*/
static Iterator<JsonNode> fetchCategoryMembers(String endpoint, String categoryName, boolean subcategories) {
// TODO
}
To migrate to this architecture, I propose the following steps:
- the current
FileFetcher
class is adapted to implementIterator<JsonNode>
instead ofIterator<FileRecord>
: it is no longer responsible for parsing each JSON result into a FileRecord. Furthermore, theFileFetcher
constructor takes a new boolean parameter indicating whether it should fetch files or subcategories (it cannot do both). - the static method fetchCategoryMembers is a simple wrapper on top of
FileFetcher
- the removed parsing code is moved into
fetchDirectFileMembers
, which converts theIterator<JsonNode>
to anIterator<FileRecord>
by parsing each result - similarly, the
fetchSubcategories
does a similar parsing, but extracting only the category names without pageids - finally, the
listCategoryMembers
method uses bothfetchSubcategories
andfetchDirectFileMembers
into a recursive algorithm which parses categories up to a certain depth.