barrust/mediawiki

Combine property pulls into single requests

barrust opened this issue · 3 comments

In order to reduce the load on MediaWiki servers, it would be good to combine as many of the property requests as possible. Things that can be pulled at one time should.

Some possible pit falls:

  • Figuring out how to properly use the continue parameter when multiple elements are being returned
  • Determining which properties should be combined into a single MediaWiki API request

The best possible outcome would be to pull some of the properties used when pulling the main page information. To do this will require quite a bit of rework but I think it would be a great addition and would reduce the number of calls against the mediaiwiki site.

@barrust, I've prepared MVP of possible solution. You can check it here.
I would like to hear your opinion.

To make code more clear I've decided to create python descriptors in separate module instead of expanding MediaWikiPage.

MediaWikiPageProperty
I've implemented MediaWikiPageProperty which is base class for all future page properties like: content, categories etc.
Every child of MediaWikiPageProperty has two functions:

  • get_query_params — returns default query_params
  • parse_query_data — gets required value from response

MediaWikiPagePropertyHandler
Above changes made implementation of function get_batch_properties which may combine query_params and decrease number of overall queries possible.

combine_query_params is responsible for combining queries.
This function considers following rules:

  • According to wikimedia API only one generator is allowed in one request.
  • Function prevents from combining equal properties into single request. E.g. Content and Summary will be separated because both contains prop=extracts.
    Exception: Specifying pages like: titles, pageids and revids.

@shnela this is very interesting and a very different method than I was thinking. I was imagining a "simpler" (in my mind) solution of just cherry picking which properties are generally pulled together and merge those. It would require a slightly different method of the _continued_query to merge the results into a single dictionary object before parsing.

Something like all the pre-populated properties joined together unless they are generators, in which case, they can't be used together.

Thoughts?