/MassUploadLibrary

Managing a mass-upload to Wikimedia Commons.

Primary LanguagePythonMIT LicenseMIT

MassUploadLibrary

Managing a mass-upload to Wikimedia Commons.

Overview

This library provides all is needed (hopefully) to mass-upload content media and its associated metadata to Wikimedia Commons.

Its philosophy is to provide a nearly fully featured codebase, with entry points to tailor the behaviour to each specific case.

Features include:

  • Attachment of metadata post-processors per metadata field
  • Mapping metadata with a wiki-based alignment
  • Use of a Data ingestion template

Dependencies

Actual upload is made using Pywikibot (in its core version), its upload.py and data_ingestion.py.

Usage

This library by itself does not do much. See the TrutatBis project for a minimal implementation.

It basically consists of overloading some methods, and passing new ones.

Pre-processing

Pre-processing is triggered through --prepare-alignment CLI parameter

It consists in:

  • indexing all the metadata, per field, counting each field value.
  • outputing it as a Template-based wikitable − see 1 or 2.

This wiki table is to be used by volunteers to match the insitution metadata to Wikimedia Commons metadata − either in values for the (typically) {{Artwork}} template, or in categories to be added.

Post-processing

Pre-processing is triggered through --post-process CLI parameter

In this step, we associate (through a dictionary) a field to a post-processing method.

For example, we can associate the Date field to a method which parses the date to fill out a {{Other date}} template.

The PostProcessing.py module holds a bunch of useful processors, and is expected to grow over time. Very specific processors do not have to be integrated in the library.

Alignments

A specific processor makes use of the alignment performed in the previous step.

The process is the following:

  • Getting back on local disk the wikitables now holding all the mapping thanks to awesome volunteers.
  • Specifying which fields are to be retrieved.
  • Retrieving the fields
  • Instructing, in the mapping dictionary, that such fields is to be processed with the alignment.

Here is an example of the code from the TrutatBis project:

mapping_fields = ['Type de document', 'Support', 'Technique', 'Auteur']
mapper = commonprocessors.retrieve_metadata_alignments(mapping_fields, alignment_template)
mapping_methods = {
    'Format': (processors.parse_format, {}),
    'Analyse': (processors.look_for_date, {}),
    'Auteur': (commonprocessors.process_with_alignment, {'mapper': mapper}),
    'Support': (commonprocessors.process_with_alignment, {'mapper': mapper}),
    'Technique': (commonprocessors.process_with_alignment, {'mapper': mapper}),
    }
reader = collection.post_process_collection(mapping_methods)

Dry-run

Dry-run is triggered through --dry-run CLI parameter.

It outputs the transformed metadata in Wikimedia Commons format, useful either for peer review at the batch uploading coordination page, or to be sent to the GLAM if they really want to upload themselves through good old upload form.

Upload

Triggered through --upload CLI parameter.

What it says on the tin: upload to Wikimedia Commons using Pywikibot.

Data ingestion template

In this process, metadata is formatted as a MediaWiki template:

{{CoolInstitutionIngestionTemplate
|key1=value1
|key2=value2
}}

Where {{CoolInstitutionIngestionTemplate}} is a cool Data ingestion template.

  • it can be made to use {{Artwork}}, {{Photograph}}, whatever is better.
  • it maps the various insitution fields to our fields.
  • It is to be susbst-ed recursively at upload time.

What is nice with it is that people do not need to get their hands into the code to help out, they can just edit the template.

Data ingestion templates can be made arbitrarily complex, with crazy {{#if}} parser functions triggering categorisation or whatever. Some problems are more easily solved this way rather than implementing it in Python.

Basic categorisation statistics

As we saw earlier, categories may be added as part of the post-processing (mainly through alignment). Every addition is tracked, both per category and per file.

A method allows to compute basic measures on this:

  • Per category:
    • Total number of categories added, and of distinct ones
    • Most used, less used, mean and median
  • Per file:
    • Most and less categorised files
    • Number of uncategorised files
    • Number and percentage of files with less than N categories
    • Average and median categorisation

(Note that some mean and median computations need numpy).

Installation

Easiest way to install should be to use pip with git:

pip install git+git://github.com/Commmonists/MassUploadLibrary.git#egg=uploadlibrary

But this usually fails when failing to solve the Pywikibot dependency.

Alternatively, you can clone the repository and install it using setuptools:

python setup.py install

Note that Pywikibot dependency is sometimes tricky to resolve automatically. If it fails, consider installing it manually.

License

MIT license