TD;DR: Vocabulary tests are useful to assess language proficiency. This repository allows create vocabulary. ๐ Try your vocabulary knowledge
Clone the repository and install the requirements:
git clone https://github.com/polvanrijn/VocabTest
cd VocabTest
REPO_DIR=$(pwd)
python3.9 -m venv env # Setup virtual environment, I used Python 3.9.18 on MacOS
pip install -r requirements.txt
pip install -e .
Optionally: Install dictionaries
Make sure you either have hunspell
or myspell
installed.
DIR_DICT = ~/.config/enchant/hunspell # if you use hunspell
DIR_DICT = ~/.config/enchant/myspell # if you use myspell
mkdir -p $DIR_DICT
Download the Libreoffice dictionaries:
cd $DIR_DICT
git clone https://github.com/LibreOffice/dictionaries
find dictionaries/ -type f -name "*.dic" -exec mv -i {} . \;
find dictionaries/ -type f -name "*.aff" -exec mv -i {} . \;
rm -Rf dictionaries/
Manually install missing dictionaries:
# Manually install dictionaries
function get_dictionary() {
f="$(basename -- $1)"
wget $1 --no-check-certificate
unzip $f "*.dic" "*.aff"
rm -f $f
}
# Urdu
get_dictionary https://versaweb.dl.sourceforge.net/project/aoo-extensions/2536/1/dict-ur.oxt
# Western Armenian
get_dictionary https://master.dl.sourceforge.net/project/aoo-extensions/4841/0/hy_am_western-1.0.oxt
# Galician
get_dictionary https://extensions.libreoffice.org/assets/do wnloads/z/corrector-18-07-para-galego.oxt
# Welsh
get_dictionary https://master.dl.sourceforge.net/project/aoo-extensions/1583/1/geiriadur-cy.oxt
mv dictionaries/* .
rm -Rf dictionaries/
# Belarusian
get_dictionary https://extensions.libreoffice.org/assets/downloads/z/dict-be-0-58.oxt
# Marathi
get_dictionary https://extensions.libreoffice.org/assets/downloads/73/1662621066/mr_IN-v8.oxt
mv dicts/* .
rm -Rf dicts/
Check all dictionaries are installed:
python3 -c "import enchant
broker = enchant.Broker()
print(sorted(list(set([lang.split('_')[0] for lang in broker.list_languages()]))))"
Optionally: Install FastText
cd $REPO_DIR
git clone https://github.com/facebookresearch/fastText.git
cd fastText
pip3 install .
wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
cd ..
Optionally: Install local UDPipe
Install tensorflow:
pip install tensorflow
Make sure GPU is available:
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
Install UDPipe:
cd $REPO_DIR
git clone https://github.com/ufal/udpipe
cd udpipe
git checkout udpipe-2
git clone https://github.com/ufal/wembedding_service
pip install .
Download the models
curl --remote-name-all https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4804{/udpipe2-ud-2.10-220711.tar.gz}
tar -xvf udpipe2-ud-2.10-220711.tar.gz
rm udpipe2-ud-2.10-220711.tar.gz
I had to make one change to the code to make it work locally. Change line 375 in udpipe2_server.py
to:
if not hasattr(socket, 'SO_REUSEPORT'):
socket.SO_REUSEPORT = 15
Optionally: Word alignment
cd $REPO_DIR/vocabtest/bible/
mkdir dependencies
cd dependencies
git clone https://github.com/clab/fast_align
cd fast_align
mkdir build
cd build
cmake ..
make
Optionally: Uromanize
cd vocabtest/vocabtest/bible/dependencies/
git clone https://github.com/isi-nlp/uroman
- Afrikaans ๐ฟ๐ฆ
- Arabic (many countries)
- Belarussian ๐ง๐พ
- Bulgarian ๐ง๐ฌ
- Catalan ๐ช๐ธ
- Czech ๐จ๐ฟ
- Welsh ๐ฌ๐ง
- Danish ๐ฉ๐ฐ
- German ๐ฉ๐ช๐จ๐ญ๐ฆ๐น
- Greek ๐ฌ๐ท
- English (many countries)
- Spanish (many countries)
- Estionian ๐ช๐ช
- Basque ๐ช๐ธ
- Persian ๐ฎ๐ท๐ฆ๐ซ๐น๐ฏ
- Finnish ๐ซ๐ฎ
- Faroese ๐ฉ๐ฐ
- French (many countries)
- Irish ๐ฎ๐ช
- Gaelic (Scottish) ๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ
- Galician ๐ช๐ธ
- Gothic (dead)
- Hebrew ๐ฎ๐ฑ
- Hindi ๐ฎ๐ณ
- Croatian ๐ญ๐ท
- Hungarian ๐ญ๐บ
- Armenian ๐ฆ๐ฒ
- Western Armenian
- Indonesia ๐ฎ๐ฉ
- Icelandic ๐ฎ๐ธ
- Italian ๐ฎ๐น
- Japanese ๐ฏ๐ต
- Korean ๐ฐ๐ท
- Latin (dead)
- Lithuania ๐ฑ๐น
- Latvian ๐ฑ๐ป
- Marathi ๐ฎ๐ณ
- Maltese ๐ฒ๐น
- Dutch ๐ณ๐ฑ๐ง๐ช
- Norwegian Nynorsk ๐ณ๐ด
- Norwegian Bokmรฅl ๐ณ๐ด
- Polish ๐ต๐ฑ
- Portuguese ๐ต๐น
- Romanian ๐ท๐ด
- Russian ๐ท๐บ
- Sanskrit ๐ฎ๐ณ
- Northern Sami ๐ณ๐ด
- Slovak ๐ธ๐ฐ
- Slovenian ๐ธ๐ฎ
- Serbian ๐ท๐ธ
- Swedish ๐ธ๐ช
- Tamil ๐ฎ๐ณ๐ฑ๐ฐ๐ธ๐ฌ
- Telugu ๐ฎ๐ณ
- Turkish ๐น๐ท
- Uyghur ๐จ๐ณ
- Ukranian ๐บ๐ฆ
- Urdu ๐ต๐ฐ๐ฎ๐ณ
- Vietnamese ๐ป๐ณ
- Wolof ๐ธ๐ณ
- Chinese ๐จ๐ณ
Creating your own vocabulary test is easy. The only thing you need is a large amount of text in a language and need to implement two functions:
vocabtest.<your_dataset>.download
: which downloads the dataset and stores it in a subfolder calleddata
vocabtest.<your_dataset>.filter
: which filters and cleans the dataset and stores the following files in thedatabase
subfolder:{language_id}-filtered.csv
is a table withword
andcount
of all words that pass the filter,{language_id}-clean.txt
is text file with all words that are cleaned, which is used for training the compound word splitter,{language_id}-all.txt
is text file with all words occurring in the corpus, which is used to reject pseudowords which are already in the corpus
You can now run your vocabulary test with:
vocabtest download <your_dataset> <language_id>
vocabtest filter <your_dataset> <language_id>
vocabtest create-pseudowords <your_dataset> <language_id>
vocabtest create-test <your_dataset> <language_id>
@misc{vanrijn2023wikivocab,
title={Around the world in 60 words: A generative vocabulary test for online research},
author={Pol van Rijn and Yue Sun and Harin Lee and Raja Marjieh and Ilia Sucholutsky and Francesca Lanzarini and Elisabeth Andrรฉ and Nori Jacoby},
year={2023},
eprint={2302.01614},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2302.01614},
}