OpenRefine/CommonsExtension

Depth support for category fetching

Closed this issue · 0 comments

We want to support fetching subcategories recursively up to some depth, like other tools like Petscan.

Here is a proposed architecture for this.

/**
  * Fetches a category recursively, up to the given depth, from the MediaWiki API.
  * The stream of FileRecords contains the filenames and mids, but not the related
  * categories (which must be fetched separately).
  * Set the depth to 0 to ignore subcategories.
  */
static Iterator<FileRecord> listCategoryMembers(String endpoint, String categoryName, int depth) {
    // TODO
}

/**
 * Fetches the direct subcategories of a given category, from the MediaWiki API.
 * The supplied stream contains category names (TBD: with or without the `Category:` prefix?).
 */
static Iterator<String> fetchSubcategories(String endpoint, String categoryName) {
    // TODO
}

/**
 * Fetches the files which are direct members of a given category, from the MediaWiki API.
 * The stream of FileRecords contains the filenames and mids, but not the related
 * categories (which must be fetched separately).
 */
static Iterator<FileRecord> fetchDirectFileMembers(String endpoint, String categoryName) {
   // TODO
}

/**
 * Internal function used to iterate over the paginated results of the MediaWiki API
 * when fetching files or categories. This function is used both by fetchSubcategories and
 * by fetchDirectFileMembers.
 * The `subcategories` parameter can be set to true to fetch categories and false to fetch files
 */
static Iterator<JsonNode> fetchCategoryMembers(String endpoint, String categoryName, boolean subcategories) {
   // TODO
}

To migrate to this architecture, I propose the following steps:

  • the current FileFetcher class is adapted to implement Iterator<JsonNode> instead of Iterator<FileRecord>: it is no longer responsible for parsing each JSON result into a FileRecord. Furthermore, the FileFetcher constructor takes a new boolean parameter indicating whether it should fetch files or subcategories (it cannot do both).
  • the static method fetchCategoryMembers is a simple wrapper on top of FileFetcher
  • the removed parsing code is moved into fetchDirectFileMembers, which converts the Iterator<JsonNode> to an Iterator<FileRecord> by parsing each result
  • similarly, the fetchSubcategories does a similar parsing, but extracting only the category names without pageids
  • finally, the listCategoryMembers method uses both fetchSubcategories and fetchDirectFileMembers into a recursive algorithm which parses categories up to a certain depth.