/templatesAlignment

Aligning named tempaltes on Wikipedia

Primary LanguageJupyter Notebook

Aligning named tempaltes on Wikipedia

One challenge for the Content Translation tool is translate mediwiki templates. While finding the template in target language can be done through Wikidata, finding the mapping between template parameters is not trivial.

Here we aim to solve this problem by using multilingual fasttext vectors.

This repository contains:

Installation

Setup the enviroment

  • Clone this repository
git clone https://github.com/digitalTranshumant/templatesAlignment.git
  • Install python3 requirements
 pip install -r requirements.txt
  • Download the Wikipedia Dumps for all the languages that you want to align.
  • Write the languages codes that you want to use in config.json file ex: {"langs": ["es", "en", "fr", "ar", "ru", "uk", "pt", "vi", "zh", "ru", "he", "it", "ta", "id", "fa", "ca"]}

Run the python scripts

  • 00ExtractNamedTempates.ipynb: Run this notebook to extract all templates from dumps. This process can be slow, for example, in a machine with 8 cores, and 64Gb of Ram, it takes around 24 hours (in total) for the 18 languages listed before.
  • 01Download.ipynb: This scripts download the pre-trained fasttext models. Each model unziped would use around 15Gb of hard drivee.
  • 02alignmentsSpark.ipynb: This script create the alignment matrices for each language pair. The script is create to work in a Spark cluster, but you can modified for using in normal python, replacing the parquet data by a SQL dump from this public table.
  • 03ProduceAlignments.ipynb: This script creates the alignments. It can take around 10 hours to run.

TODOs

Remove hardcoded parameters

The script ProduceAlignments Notebook have few hard coded parameters that should be moved to the config file and adjusted more properly:

  • Currently we are considering templates that appears at least 50 times in each edition. That number is arbitrary. Increasing the number will result in less templates to be consider. USing that number very low, would include nois (typos) and make the process too slow.
templates1Popular = dict([ (name,data)for name,data in templates1.items() if data['Tcount'] >50 ])
  • The number of parameters considered is also hardcoded, now we are considering paramters that appears at least the 30% that the template is used. The tradeoff of moving this number is similar with the previous case:
template1['params'] = [x[0] for x in template1['Params'].items() if x[1]/len(template1['Params']) > .3] #appears at least X% of times
  • The minimum similarity accepted to validate an alignment is also arbitrarily defined at .45
 if dist < .45:

To learn to correct value for all this parameters we need some ground-truth that is currently missing.