/parse-langlinks

A Python Script to Parse Wikimedia langlinks SQL Dumps

Primary LanguagePython

Parsing Inter Language Links Out of Wikimedia Data Dumps

Author: Bill Thompson (biltho@mpi.nl)

Summary

A Simple python script to extract inter-language links from Wikimedia sql dumps. Writes out a csv with (page_id, target_language, page_title_in_target_language) columns

Usage:

python main.py -f avwiki-latest-langlinks.sql.gz

The latest dumps for the English wikipedia, for example, can be found here. This script works on the sql version of the langlinks table dump (e.g.: link). This repository contains an example dump (avwiki-latest-langlinks.sql.gz) and an example of the parsed result (avwiki-latest-langlinks-parsed.csv). The latest dumps in other languages can be found at:

dumps.wikimedia.org/LANGUAGEwiki/latest/LANGUAGEwiki-latest-langlinks.sql.gz

where LANGUAGE is replaced by the language iso (e.g. en, ab, es, pt, fr, etc...)