/VocabTest

Vocabulary Tests for an Open-Ended Number of Languages from just Text

Primary LanguagePython

๐Ÿ“– VocabTest

Vocabulary tests for an open-ended number of languages

TD;DR: Vocabulary tests are useful to assess language proficiency. This repository allows create vocabulary. ๐Ÿ“ Try your vocabulary knowledge


example items from WikiVocab

Setup

Clone the repository and install the requirements:

git clone https://github.com/polvanrijn/VocabTest
cd VocabTest
REPO_DIR=$(pwd)
python3.9 -m venv env  # Setup virtual environment, I used Python 3.9.18 on MacOS
pip install -r requirements.txt
pip install -e .

Developer requirements

Optionally: Install dictionaries

Make sure you either have hunspell or myspell installed.

DIR_DICT = ~/.config/enchant/hunspell # if you use hunspell
DIR_DICT = ~/.config/enchant/myspell # if you use myspell
mkdir -p $DIR_DICT

Download the Libreoffice dictionaries:

cd $DIR_DICT
git clone https://github.com/LibreOffice/dictionaries
find dictionaries/ -type f -name "*.dic" -exec mv -i {} .  \;
find dictionaries/ -type f -name "*.aff" -exec mv -i {} .  \;
rm -Rf dictionaries/

Manually install missing dictionaries:

# Manually install dictionaries
function get_dictionary() {
  f="$(basename -- $1)"
  wget $1 --no-check-certificate
  unzip $f "*.dic" "*.aff"
  rm -f $f
}

# Urdu
get_dictionary https://versaweb.dl.sourceforge.net/project/aoo-extensions/2536/1/dict-ur.oxt

# Western Armenian
get_dictionary https://master.dl.sourceforge.net/project/aoo-extensions/4841/0/hy_am_western-1.0.oxt

# Galician
get_dictionary https://extensions.libreoffice.org/assets/do wnloads/z/corrector-18-07-para-galego.oxt

# Welsh
get_dictionary https://master.dl.sourceforge.net/project/aoo-extensions/1583/1/geiriadur-cy.oxt
mv dictionaries/* .
rm -Rf dictionaries/

# Belarusian
get_dictionary https://extensions.libreoffice.org/assets/downloads/z/dict-be-0-58.oxt

# Marathi
get_dictionary https://extensions.libreoffice.org/assets/downloads/73/1662621066/mr_IN-v8.oxt
mv dicts/* .
rm -Rf dicts/

Check all dictionaries are installed:

python3 -c "import enchant
broker = enchant.Broker()
print(sorted(list(set([lang.split('_')[0] for lang in broker.list_languages()]))))"
Optionally: Install FastText
cd $REPO_DIR
git clone https://github.com/facebookresearch/fastText.git
cd fastText
pip3 install .
wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
cd ..
Optionally: Install local UDPipe

Install tensorflow:

pip install tensorflow

Make sure GPU is available:

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Install UDPipe:

cd $REPO_DIR
git clone https://github.com/ufal/udpipe
cd udpipe
git checkout udpipe-2
git clone https://github.com/ufal/wembedding_service
pip install .

Download the models

curl --remote-name-all https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4804{/udpipe2-ud-2.10-220711.tar.gz}
tar -xvf udpipe2-ud-2.10-220711.tar.gz
rm udpipe2-ud-2.10-220711.tar.gz

I had to make one change to the code to make it work locally. Change line 375 in udpipe2_server.py to:

if not hasattr(socket, 'SO_REUSEPORT'):
     socket.SO_REUSEPORT = 15
Optionally: Word alignment
cd $REPO_DIR/vocabtest/bible/
mkdir dependencies
cd dependencies
git clone https://github.com/clab/fast_align
cd fast_align
mkdir build
cd build
cmake ..
make
Optionally: Uromanize
cd vocabtest/vocabtest/bible/dependencies/
git clone https://github.com/isi-nlp/uroman

Tests

WikiVocab: Validated vocabulary test for 60 languages

  1. Afrikaans ๐Ÿ‡ฟ๐Ÿ‡ฆ
  2. Arabic (many countries)
  3. Belarussian ๐Ÿ‡ง๐Ÿ‡พ
  4. Bulgarian ๐Ÿ‡ง๐Ÿ‡ฌ
  5. Catalan ๐Ÿ‡ช๐Ÿ‡ธ
  6. Czech ๐Ÿ‡จ๐Ÿ‡ฟ
  7. Welsh ๐Ÿ‡ฌ๐Ÿ‡ง
  8. Danish ๐Ÿ‡ฉ๐Ÿ‡ฐ
  9. German ๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡จ๐Ÿ‡ญ๐Ÿ‡ฆ๐Ÿ‡น
  10. Greek ๐Ÿ‡ฌ๐Ÿ‡ท
  11. English (many countries)
  12. Spanish (many countries)
  13. Estionian ๐Ÿ‡ช๐Ÿ‡ช
  14. Basque ๐Ÿ‡ช๐Ÿ‡ธ
  15. Persian ๐Ÿ‡ฎ๐Ÿ‡ท๐Ÿ‡ฆ๐Ÿ‡ซ๐Ÿ‡น๐Ÿ‡ฏ
  16. Finnish ๐Ÿ‡ซ๐Ÿ‡ฎ
  17. Faroese ๐Ÿ‡ฉ๐Ÿ‡ฐ
  18. French (many countries)
  19. Irish ๐Ÿ‡ฎ๐Ÿ‡ช
  20. Gaelic (Scottish) ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ
  21. Galician ๐Ÿ‡ช๐Ÿ‡ธ
  22. Gothic (dead)
  23. Hebrew ๐Ÿ‡ฎ๐Ÿ‡ฑ
  24. Hindi ๐Ÿ‡ฎ๐Ÿ‡ณ
  25. Croatian ๐Ÿ‡ญ๐Ÿ‡ท
  26. Hungarian ๐Ÿ‡ญ๐Ÿ‡บ
  27. Armenian ๐Ÿ‡ฆ๐Ÿ‡ฒ
  28. Western Armenian
  29. Indonesia ๐Ÿ‡ฎ๐Ÿ‡ฉ
  30. Icelandic ๐Ÿ‡ฎ๐Ÿ‡ธ
  31. Italian ๐Ÿ‡ฎ๐Ÿ‡น
  32. Japanese ๐Ÿ‡ฏ๐Ÿ‡ต
  33. Korean ๐Ÿ‡ฐ๐Ÿ‡ท
  34. Latin (dead)
  35. Lithuania ๐Ÿ‡ฑ๐Ÿ‡น
  36. Latvian ๐Ÿ‡ฑ๐Ÿ‡ป
  37. Marathi ๐Ÿ‡ฎ๐Ÿ‡ณ
  38. Maltese ๐Ÿ‡ฒ๐Ÿ‡น
  39. Dutch ๐Ÿ‡ณ๐Ÿ‡ฑ๐Ÿ‡ง๐Ÿ‡ช
  40. Norwegian Nynorsk ๐Ÿ‡ณ๐Ÿ‡ด
  41. Norwegian Bokmรฅl ๐Ÿ‡ณ๐Ÿ‡ด
  42. Polish ๐Ÿ‡ต๐Ÿ‡ฑ
  43. Portuguese ๐Ÿ‡ต๐Ÿ‡น
  44. Romanian ๐Ÿ‡ท๐Ÿ‡ด
  45. Russian ๐Ÿ‡ท๐Ÿ‡บ
  46. Sanskrit ๐Ÿ‡ฎ๐Ÿ‡ณ
  47. Northern Sami ๐Ÿ‡ณ๐Ÿ‡ด
  48. Slovak ๐Ÿ‡ธ๐Ÿ‡ฐ
  49. Slovenian ๐Ÿ‡ธ๐Ÿ‡ฎ
  50. Serbian ๐Ÿ‡ท๐Ÿ‡ธ
  51. Swedish ๐Ÿ‡ธ๐Ÿ‡ช
  52. Tamil ๐Ÿ‡ฎ๐Ÿ‡ณ๐Ÿ‡ฑ๐Ÿ‡ฐ๐Ÿ‡ธ๐Ÿ‡ฌ
  53. Telugu ๐Ÿ‡ฎ๐Ÿ‡ณ
  54. Turkish ๐Ÿ‡น๐Ÿ‡ท
  55. Uyghur ๐Ÿ‡จ๐Ÿ‡ณ
  56. Ukranian ๐Ÿ‡บ๐Ÿ‡ฆ
  57. Urdu ๐Ÿ‡ต๐Ÿ‡ฐ๐Ÿ‡ฎ๐Ÿ‡ณ
  58. Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ
  59. Wolof ๐Ÿ‡ธ๐Ÿ‡ณ
  60. Chinese ๐Ÿ‡จ๐Ÿ‡ณ

BibleVocab: Vocabulary test for more than 2000 languages

Create your own vocabulary test

Creating your own vocabulary test is easy. The only thing you need is a large amount of text in a language and need to implement two functions:

  • vocabtest.<your_dataset>.download: which downloads the dataset and stores it in a subfolder called data
  • vocabtest.<your_dataset>.filter: which filters and cleans the dataset and stores the following files in the database subfolder:
    • {language_id}-filtered.csv is a table with word and count of all words that pass the filter,
    • {language_id}-clean.txt is text file with all words that are cleaned, which is used for training the compound word splitter,
    • {language_id}-all.txt is text file with all words occurring in the corpus, which is used to reject pseudowords which are already in the corpus

You can now run your vocabulary test with:

vocabtest download <your_dataset> <language_id>
vocabtest filter <your_dataset> <language_id>
vocabtest create-pseudowords <your_dataset> <language_id>
vocabtest create-test <your_dataset> <language_id>

Citation

@misc{vanrijn2023wikivocab,
      title={Around the world in 60 words: A generative vocabulary test for online research}, 
      author={Pol van Rijn and Yue Sun and Harin Lee and Raja Marjieh and Ilia Sucholutsky and Francesca Lanzarini and Elisabeth Andrรฉ and Nori Jacoby},
      year={2023},
      eprint={2302.01614},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2302.01614}, 
}