/bio-index

Set of scripts used to import data from Polish Wikipedia index of biographies to Wikidata and to generate and manage said index

Primary LanguageRubyOtherNOASSERTION

Synopsis

This is a set of scripts used for:

Written in Ruby and JavaScript.

License

The MIT License, partially dual-licesed under CC BY-SA to allow certain files to be freely pasted on pages of Wikimedia projects.

For details and list of contributors see LICENSE.

Libraries

A whole lot. Apart from the standard Ruby library some of the scripts require the following gems (in latest available versions as of 2013-09-27):

  • roman
  • json
  • nokogiri
  • parallel
  • sunflower
  • unicode_utils
  • unidecoder

The code has only been tested on Ruby 1.9.3. It will probably run on newer Rubies, too.

Details

Most of the text (in Polish) and configuration (for the Polish Wikipedia) is hardcoded in the .rb and .js files. Sorry 'bout that.

Brief description of each file:

  • Wikipedia gadget

    • bioindex-editor.css and bioindex-editor.js – a gadget that allows editors to modify the Wikidata descriptions and Wikipedia defaultsorts straight from the index itself.
    • bioindex-editor-bootstrap.js – minimal loader for the gadget, to be added to common.js.
  • Primary scripts

    • build-index.rb – aggregate data from all sources and upload them to the index. Takes a few hours to run; generates temporary 'savepoints' which will be used as starting point (this allows it to be terminated at will without losing all the work).
    • parse-index.rb – parse old index of biographies and dump the data in JSON format to current directory.
    • upload-index.rb – upload the data generated by the above script to Wikidata.
    • sprzeczne.rb – compare birth and death year data aggregated from categories and from the old index of biographies, return a pretty table.
  • Mini-libraries

    • intro-extractor.rb – extracts brief descriptions and lifetime information from given Wikipedia pages.
    • roman.rb – wrapper for the roman gem to fix its broken handling for negative numbers (used to deal with centuries BC).
    • savepoint.rb – short wrapper for Marshal.load and .dump from/to file.
  • Miscellanea

    • .gitignore – contains a list of temporary files running the Ruby scripts might generate.
    • LICENSE – MIT / CC BY-SA.
    • README.md – this file.