Align Wikipedia documents based on interlanguage links
This code requires python 3.x
To use this tool, you must install requirements
pip install -r requirements.txt
or if Python alias is not setup on your system:
pip3 install -r requirements.txt
align wikipedia documents given that you have the sql of interlanugage links and document extracts
python aligner.py --src-lang en --target-lang ka --sql-file data/enwiki-latest-langlinks.sql --src-corpus data/enwiki --target-corpus data/kawiki --out-dir data/out/
or if Python alias is not setup on your system:
python3 aligner.py --src-lang en --target-lang ka --sql-file data/enwiki-latest-langlinks.sql --src-corpus data/enwiki --target-corpus data/kawiki --out-dir data/out/
usage: aligner.py [-h] --src-lang SRC_LANG --target-lang TARGET_LANG
--sql-file SQL_FILE --src-corpus SRC_CORPUS --target-corpus
TARGET_CORPUS --out-dir OUT_DIR
Align Wikipedia documents based on interlanguage links .
optional arguments:
-h, --help show this help message and exit
--src-lang SRC_LANG source language. e.g., ar for Arabic, en for English,
or fr for French ...
--target-lang TARGET_LANG
target language. e.g., ar for Arabic, en for English,
or fr for French ...
--sql-file SQL_FILE source language links sql file. Obtained from
https://dumps.wikimedia.org/
--src-corpus SRC_CORPUS
source corpus directory.
--target-corpus TARGET_CORPUS
target corpus directory.
--out-dir OUT_DIR the output directory.
to get information about the corpus (the most frequent words)
python corpus_info.py data/arz.wiki
This project is used to extract Comparable Documents from Wikipedia
Motaz Saad and Basem Alijla (2017). WikiDocsAligner: an off-the-shelf Wikipedia Documents Alignment Tool. in The Second Palestinian International Conference on Information and Communication Technology (PICICT 2017).
Your contributions to improve the code are welcomed. Please follow the steps below.
- Fork the project.
- Modify the code, test it, make sure that it works fine.
- Make a pull request.
Please consult github help to get help.