/navec

Compact high quality word embeddings for Russian language

Primary LanguagePythonMIT LicenseMIT

CI

Navec is a library of pretrained word embeddings for Russian language. It shows competitive or better results than RusVectores, loads ~10 times faster (~1 sec), takes ~10 times less space (~50 MB).

Navec = large Russian text datasets + vanila GloVe + quantization

Downloads

How to read model filename:

navec_hudlit_v1_12B_500K_300d_100q.tar
                 |    |    |    |
                 |    |    |     ---- 100 dimentions after quantization
                 |    |     --------- original vectors have 300 dimentions
                 |     -------------- vocab size is 500 000 words + 2 for <unk>, <pad>
                  ------------------- dataset of 12 billion tokens was used

Currently two models are published:

Model Size Description Sources
navec_hudlit_v1_12B_500K_300d_100q.tar # 50MB Should be used by default. Shows best results on intrinsic evaluations. Model was trained on large corpus of
an literature (~150GB).
librusec
navec_news_v1_1B_250K_300d_100q.tar # 25MB Try to use this model to news texts. It is two times smaller than `hudlit` but covers same 98% of words in news articles. lenta ria taiga_fontanka buriy_news buriy_webhose ods_gazeta ods_interfax

Installation

Navec supports Pyton 3.7+ and PyPy 3.

$ pip install navec

Usage

First download hudlit emdeddings (see the table above):

wget https://storage.yandexcloud.net/natasha-navec/packs/navec_hudlit_v1_12B_500K_300d_100q.tar

Load tar-archive with Navec.load, it takes ~1s and ~100MB of RAM:

>>> from navec import Navec

>>> path = 'hudlit_12B_500K_300d_100q.tar'
>>> navec = Navec.load(path)

Then navec can be used as a dict object:

>>> navec['навек']
array([ 0.3955571 ,  0.11600914,  0.24605067, -0.35206917, -0.08932345,
        0.3382279 , -0.5457616 ,  0.07472657, -0.4753835 , -0.3330848 ,
        ...

>>> 'нааавееек' in navec
False

>>> navec.get('нааавееек')
None

To get an index of word, use navec.vocab:

>>> navec.vocab['навек']
225823

>>> navec.vocab.get('наааавеeeк', navec.vocab.unk_id)
500000   # == navec.vocab['<unk>']

There are two special words in vocab, <unk> and <pad>:

>>> navec['<unk>']
array([ 3.69125791e-02,  9.32818875e-02,  2.01917738e-02, ...

>>> navec['<pad>']
array([0., 0., 0., 0., 0., 0., ...

To use Navec in PyTorch model there is a Slovnet module:

>>> import torch
>>> from slovnet.model.emb import NavecEmbedding

>>> emb = NavecEmbedding(navec)
>>> input = torch.tensor([1, 2, 0])
>>> output = emb(input)

>>> output.shape
torch.Size([3, 300])

>>> output
tensor([[ 4.2000e-01,  3.6666e-01,  1.7728e-01, -3.8719e-01, -1.0762e-01,
          1.6954e-01, -4.6063e-01,  5.4519e-01, -2.1212e-01,  2.0965e-01,
          1.9658e-01,  2.7807e-01, -2.3802e-01,  3.5155e-01,  1.4491e-02,
		  ...

Documentation

Materials are in Russian:

Evaluation

Let's compore Navec to top 5 RusVectores models (based on simlex and hj eval datasets). In each column top 3 results are highlighted.

  • init — time it takes to load model file to RAM. tayga_upos_skipgram_300_2_2019 word2vec binary file takes 5 seconds to load with gensim.KeyedVectors.load_word2vec_format. tayga_none_fasttextcbow_300_10_2019 fastText large ~2.7 GB file takes 8 seconds. Navec hudlit with vocab 2 times larger than previous two takes 1 second.
  • get — time is takes to query embedding vector for a single word. Word2vec models win here, to fetch a vector they basically do dict.get. FastText and Navec for every query do extra computation. FastText extracts and sums word ngrams, Navec unpacks vector from quantization table. In practice query to embeddings table is small compared to all other computation in network.
  • disk — model file size. It is convenient for deployment and distribution to have small models. Notice that hudlit model is 4-20 times smaller with vocab size 2 times bigger.
  • ram — space model takes in RAM after loading. It is convenient to have small memory footprint to fit more computation on single server.
  • vocab — number of words in vocab, number of embedding vectors. Since Navec vectors table takes less space we can have larger vocab. With 500K vocab hudlit model has ~2% OVV rate on average.
type init, s get, µs disk, mb ram, mb vocab
hudlit_12B_500K_300d_100q navec 1.1 21.6 50.6 95.3 500K
news_1B_250K_300d_100q navec 0.8 20.7 25.4 47.7 250K
ruscorpora_upos_cbow_300_20_2019 w2v 3.3 1.4 220.6 236.1 189K
ruwikiruscorpora_upos_skipgram_300_2_2019 w2v 5.0 1.5 290.0 309.4 248K
tayga_upos_skipgram_300_2_2019 w2v 5.2 1.4 290.7 310.9 249K
tayga_none_fasttextcbow_300_10_2019 fasttext 8.0 13.4 2741.9 2746.9 192K
araneum_none_fasttextcbow_300_5_2018 fasttext 16.4 10.6 2752.1 2754.7 195K

Now let's look at intrinsic evaluation scores. Navec hudlit model does not show best results on all datasets but it is usually in top 3. We'll use 6 datasets:

  • simlex965, hj — two small datasets (965 and 398 tests respectively) used by RusVectores, see the their paper for more info. Metric is spearman correlation, other datasets use average precision.
  • rt, ae, ae2 — large datasets (114066, 22919, 86772 tests) from RUSSE workshop, see project description for more.
  • lrwc — relatively new dataset by Yandex.Toloka, see their page.
type simlex hj rt ae ae2 lrwc
hudlit_12B_500K_300d_100q navec 0.310 0.707 0.842 0.931 0.923 0.604
news_1B_250K_300d_100q navec 0.230 0.590 0.784 0.866 0.861 0.589
ruscorpora_upos_cbow_300_20_2019 w2v 0.359 0.685 0.852 0.758 0.896 0.602
ruwikiruscorpora_upos_skipgram_300_2_2019 w2v 0.321 0.723 0.817 0.801 0.860 0.629
tayga_upos_skipgram_300_2_2019 w2v 0.429 0.749 0.871 0.771 0.899 0.639
tayga_none_fasttextcbow_300_10_2019 fasttext 0.369 0.639 0.793 0.682 0.813 0.536
araneum_none_fasttextcbow_300_5_2018 fasttext 0.349 0.671 0.801 0.706 0.793 0.579

Support

Development

Dev env

python -m venv ~/.venvs/natasha-navec
source ~/.venvs/natasha-navec/bin/activate

pip install -r requirements/dev.txt
pip install -e .

Test + lint

make test

Release

# Update setup.py version

git commit -am 'Up version'
git tag v0.10.0

git push
git push --tags

Notice! All commands belows use code from navec/train, it is not under CI, it works only with Python 3, it is expected user is familiar with source code. We use Yandex Cloud Compute and Object Storage.

Create remote worker

To compute cooc (large HDD, 1Tb for librusec).

yc compute instance create \
    --name worker \
    --zone ru-central1-a \
    --network-interface subnet-name=default,nat-ip-version=ipv4 \
    --create-boot-disk image-folder-id=standard-images,image-family=ubuntu-1804,type=network-hdd,size=1000 \
    --memory 8 \
    --cores 2 \
    --core-fraction 100 \
    --ssh-key ~/.ssh/id_rsa.pub \
    --folder-name default \
    --preemptible  # in case computation takes <24h

To fit embedings (multiple cores). HDD should be > cooc.bin * 3 (for suffle + tmp)

yc compute instance create \
    --name worker \
    --zone ru-central1-a \
    --network-interface subnet-name=default,nat-ip-version=ipv4 \
    --create-boot-disk image-folder-id=standard-images,image-family=ubuntu-1804,type=network-hdd,size=700 \
    --memory 16 \
    --cores 16 \
    --core-fraction 100 \
    --ssh-key ~/.ssh/id_rsa.pub  \
    --folder-name default \
    --preemptible

Setup machine

yc compute instance list --folder-name default
ssh yc-user@123.123.123.123

sudo locale-gen en_US.UTF-8
sudo timedatectl set-timezone Europe/Moscow
sudo apt-get update
sudo DEBIAN_FRONTEND=noninteractive apt-get install -y language-pack-ru python3-pip screen unzip git pv cmake

wget https://nlp.stanford.edu/software/GloVe-1.2.zip
unzip GloVe-1.2.zip
rm GloVe-1.2.zip
mv GloVe-1.2 glove
cd glove
make
cd ..

export GLOVE_DIR=~/glove/build

git clone https://github.com/natasha/navec.git
sudo -H pip3 install -e navec
sudo -H pip3 install -r navec/requirements/train.txt

screen
ctrl a d

Remove instance

yc compute instance list --folder-name default
yc compute instance delete xxxxxxxxx

Env, used by navec-train s3|vocab|cooc|emb

export S3_KEY=_XxXXXxxx_XXXxxxxXxxx
export S3_SECRET=XXxxx_XXXXXXxxxxxxXXXXxxXXx-XxxXXxxxX
export S3_BUCKET=XXXXXXX
export GLOVE_DIR=~/path/to/glove/build

Share text data (see corus)

navec-train s3 upload librusec_fb2.plain.gz sources/librusec.gz
navec-train s3 upload taiga/proza_ru.zip sources/taiga_proza.zip

navec-train s3 upload ruwiki-latest-pages-articles.xml.bz2 sources/wiki.xml.bz2

navec-train s3 upload lenta-ru-news.csv.gz sources/lenta.csv.gz
navec-train s3 upload ria.json.gz sources/ria.json.gz
navec-train s3 upload taiga/Fontanka.tar.gz sources/taiga_fontanka.tar.gz
navec-train s3 upload buriy/news-articles-2014.tar.bz2 sources/buriy_news1.tar.bz2
navec-train s3 upload buriy/news-articles-2015-part1.tar.bz2 sources/buriy_news2.tar.bz2
navec-train s3 upload buriy/news-articles-2015-part2.tar.bz2 sources/buriy_news3.tar.bz2
navec-train s3 upload buriy/webhose-2016.tar.bz2 sources/buriy_webhose.tar.bz2
navec-train s3 upload ods/gazeta_v1.csv.zip sources/ods_gazeta.csv.zip
navec-train s3 upload ods/interfax_v1.csv.zip sources/ods_interfax.csv.zip

navec-train s3 download sources/librusec.gz
navec-train s3 download sources/taiga_proza.zip

navec-train s3 download sources/wiki.xml.bz2

navec-train s3 download sources/lenta.csv.gz
navec-train s3 download sources/ria.json.gz
navec-train s3 download sources/taiga_fontanka.tar.gz
navec-train s3 download sources/buriy_news1.tar.bz2
navec-train s3 download sources/buriy_news2.tar.bz2
navec-train s3 download sources/buriy_news3.tar.bz2
navec-train s3 download sources/buriy_webhose.tar.bz2
navec-train s3 download sources/ods_gazeta.csv.zip
navec-train s3 download sources/ods_interfax.csv.zip

Text to tokens

navec-train corpus librusec librusec.gz | pv | navec-train tokenize > tokens.txt  # ~12B words
navec-train corpus taiga_proza taiga_proza.zip | pv | navec-train tokenize > tokens.txt  # ~3B

navec-train corpus wiki wiki.xml.bz2 | pv | navec-train tokenize > tokens.txt  # ~0.5B

navec-train corpus lenta lenta.csv.gz | pv | navec-train tokenize >> tokens.txt
navec-train corpus ria ria.json.gz | pv | navec-train tokenize >> tokens.txt
navec-train corpus taiga_fontanka taiga_fontanka.tar.gz | pv | navec-train tokenize >> tokens.txt
navec-train corpus buriy_news buriy_news1.tar.bz2 | pv | navec-train tokenize >> tokens.txt
navec-train corpus buriy_news buriy_news2.tar.bz2 | pv | navec-train tokenize >> tokens.txt
navec-train corpus buriy_news buriy_news3.tar.bz2 | pv | navec-train tokenize >> tokens.txt
navec-train corpus buriy_webhose buriy_webhose.tar.bz2 | pv | navec-train tokenize >> tokens.txt
navec-train corpus ods_gazeta ods_gazeta.csv.zip | pv | navec-train tokenize >> tokens.txt
navec-train corpus ods_interfax ods_interfax.csv.zip | pv | navec-train tokenize >> tokens.txt  # ~1B

pv tokens.txt | gzip > tokens.txt.gz
navec-train s3 upload tokens.txt.gz librusec_tokens.txt.gz

navec-train s3 upload tokens.txt taiga_proza_tokens.txt
navec-train s3 upload tokens.txt news_tokens.txt
navec-train s3 upload tokens.txt wiki_tokens.txt

Tokens to vocab

pv tokens.txt \
	| navec-train vocab count \
	> full_vocab.txt

pv full_vocab.txt \
	| navec-train vocab quantile

# librusec
# ...
# 0.970      325 882
# 0.980      511 542
# 0.990    1 122 624
# 1.000   22 129 654

# taiga_proza
# ...
# 0.960      229 906
# 0.970      321 810
# 0.980      517 647
# 0.990    1 224 277
# 1.000   14 302 409

# wiki
# ...
# 0.950     380 134
# 0.960     519 817
# 0.970     757 561
# 0.980   1 223 201
# 0.990   2 422 265
# 1.000   6 664 630

# news
# ...
# 0.970    163 833
# 0.980    243 903
# 0.990    462 361
# 1.000  3 744 070

# threashold at ~0.98
# librusec 500000
# taiga_proza 500000
# wiki 750000
# news 250000

cat full_vocab.txt \
	| head -500000 \
	| LC_ALL=C sort \
	> vocab.txt

navec-train s3 upload full_vocab.txt librusec_full_vocab.txt
navec-train s3 upload vocab.txt librusec_vocab.txt

navec-train s3 upload full_vocab.txt taiga_proza_full_vocab.txt
navec-train s3 upload vocab.txt taiga_proza_vocab.txt

navec-train s3 upload full_vocab.txt wiki_full_vocab.txt
navec-train s3 upload vocab.txt wiki_vocab.txt

navec-train s3 upload full_vocab.txt news_full_vocab.txt
navec-train s3 upload vocab.txt news_vocab.txt

Compute coocurence pairs

# Default limit on max number of open files is 1024, merge fails if
# number of chunks is large

ulimit -n  # 1024
sudo nano /etc/security/limits.conf

* soft     nofile         65535
* hard     nofile         65535

# relogin
ulimit -n  # 65535

pv tokens.txt \
	| navec-train cooc count vocab.txt --memory 7 --window 10 \
	> cooc.bin

# Monitor
ls /tmp/cooc_*
tail -c 16 cooc.bin | navec-train cooc parse

navec-train s3 upload cooc.bin librusec_cooc.bin
navec-train s3 upload cooc.bin taiga_proza_cooc.bin
navec-train s3 upload cooc.bin wiki_cooc.bin
navec-train s3 upload cooc.bin news_cooc.bin

Merge (did not give much boost compared to plain librusec, so all_vocab.txt, all_cooc.bin not used below)

for i in librusec taiga_proza wiki news; do
	navec-train s3 download $i_vocab.txt;
	navec-train s3 download $i_cooc.bin;
done

navec-train merge vocabs \
	librusec_vocab.txt \
	taiga_proza_vocab.txt \
	wiki_vocab.txt \
	news_vocab.txt \
	| pv > vocab.txt

navec-train merge coocs vocab.txt \
	librusec_cooc.bin:librusec_vocab.txt \
	taiga_proza_cooc.bin:taiga_proza_vocab.txt \
	wiki_cooc.bin:wiki_vocab.txt \
	news_cooc.bin:news_vocab.txt \
	| pv > cooc.bin

navec-train s3 upload vocab.txt all_vocab.txt
navec-train s3 upload cooc.bin all_cooc.bin

Compute embedings

navec-train s3 download librusec_vocab.txt vocab.txt
navec-train s3 download librusec_cooc.bin cooc.bin

navec-train s3 download wiki_vocab.txt vocab.txt
navec-train s3 download wiki_cooc.bin cooc.bin

navec-train s3 download news_vocab.txt vocab.txt
navec-train s3 download news_cooc.bin cooc.bin

pv cooc.bin \
	| navec-train cooc shuffle --memory 15 \
	> shuf_cooc.bin

# Search dim with best score
for i in 100 200 300 400 500 600;
	do navec-train emb shuf_cooc.bin vocab.txt emb_${i}d.txt --dim $i --threads 10 --iterations 2;
done

# 300 has ok score. 400, 500 are a bit better, but too heavy
navec-train emb shuf_cooc.bin vocab.txt emb.txt --dim 300 --threads 16 --iterations 15

navec-train s3 upload emb.txt librusec_emb.txt
navec-train s3 upload emb.txt wiki_emb.txt
navec-train s3 upload emb.txt news_emb.txt

Quantize

navec-train s3 download librusec_emb.txt emb.txt
navec-train s3 download wiki_emb.txt emb.txt
navec-train s3 download news_emb.txt emb.txt

# Search for best compression that has still ok score
for i in 150 100 75 60 50;
	do pv emb.txt | navec-train pq fit $i --sample 100000 --iterations 15 > pq_${i}q.bin;
done

# 100 is <1% worse on eval but much lighter
pv emb.txt | navec-train pq fit 100 --sample 100000 --iterations 20 > pq.bin

navec-train pq pad < pq.bin > t; mv t pq.bin

navec-train s3 upload pq.bin librusec_pq.bin
navec-train s3 upload pq.bin wiki_pq.bin
navec-train s3 upload pq.bin news_pq.bin

Pack

navec-train s3 download librusec_pq.bin pq.bin
navec-train s3 download librusec_vocab.txt vocab.txt

navec-train s3 download news_pq.bin pq.bin
navec-train s3 download news_vocab.txt vocab.txt

navec-train vocab pack < vocab.txt > vocab.bin

navec-train pack vocab.bin pq.bin hudlit_v1_12B_500K_300d_100q
navec-train s3 upload navec_hudlit_v1_12B_500K_300d_100q.tar packs/navec_hudlit_v1_12B_500K_300d_100q.tar

navec-train pack vocab.bin pq.bin news_v1_1B_250K_300d_100q
navec-train s3 upload navec_news_v1_1B_250K_300d_100q.tar packs/navec_news_v1_1B_250K_300d_100q.tar

Publish

navec-train s3 download packs/navec_hudlit_v1_12B_500K_300d_100q.tar
navec-train s3 download packs/navec_news_v1_1B_250K_300d_100q.tar