There are existing embeddings trained on collected corpora of newswire, articles, fiction, juridical texts by the team of Ukrainian researchers. We assumed that training similar embeddings on Ukrainian Wikipedia can give us comparable (or better) results since Wikipedia contains information from larger amount of sources and fields of life. Besides it, we think that the possibility to process all articles at the same time by utilizing Big Data technologies can make data preparation process faster than manual one (combining od texts from different sources).
To prepare Wiki dump for further usage in training of Word2Vec model, we performed a few data engineering tasks:
- Parsed
.xml
file with Wiki dump and split it into smaller manageable chunks with 10 thousand articles in each. - Read all chunks and concatenated them all into a single data frame.
- Remove all non-ukrainian letters, symbols, tags, and special characters using regular expressions.
- Tokenize texts.
- Remove stop words.
- Using word count and some manual work determine service Wikipedia words (used in markdown).
- Remove service words
Cluster for word2vec PySpark and Jupyter-Notebook was launched on GCP running with Dataproc. The initial .xml
dump preprocessing was performed on Datalab instance (since we were not able to run it locally because of memory consumption).
-
Guides
-
Cluster configuration
Node | Replication factor | Memory | vCPU |
---|---|---|---|
Master | 1 | 52 GB | 2 |
Worker | 3 | 52 GB | 2 |
Total | 4 | 208 GB | 8 |
- Actions
- Copy csv with articles from
gs
to clusterhdfs
. - Download stop words and service words.
- Download
n
articles (they were split into 90csv
files containing 10000 each). - Preprocess articles.
- Create word2vec model from articles.
- Store model on
gs
(cloud storage).
Word2Vec trains a model of Map(String, Vector). Tt transforms a word into a vector of numbers for further natural language processing or machine learning process.
We created Pipeline
class to handle word2vec training.
This class includes methods:
-
init
- initialize dataframe, stop words, service words. -
preprocess
- preprocess dataframe and calculate tokens. -
fit
- takes the number of samples for training model on, i.e. we create a spark dataframe with a part of the articles.
While training, we faced many java heap errors and RPC memory limit errors. We solved this by adding more RAM memory on our cluster setup.
In a reasonable time, we got six trained word2vec models, trained on 10, 30, 50, 70, 90 and 110 thousand articles respectively.
We discovered that vector size 100 for word2vec embeddings is sufficient for our model because our dataset is relatively small.
Models were flushed on disk for evaluation step.
There is a test set for word embeddings evaluation for the Ukrainian language. The size of the test set is 23982 examples.
The idea of evaluation is to check relations between words. For example, word "king" relates to "queen" like "father" should relate to "mother". So the idea is to find the word "mother". We use gensim library to find closest relations.
In our case, we find top-10 closest words by relation (cosine distance between embeddings) and check if the target word is in top-10. And then we calculate a precision of the embeddings. We take our embeddings which were trained on the different size of the datasets (10k, 30k, 50k, 70k, 90k, 110k, 150k, 240k, 300k articles) and compared with Ward2Vec embeddings of lang-ua.
It occurred that our the best embeddings have precision close to 37% and the lang-au embeddings have 49%. It can be explained by high variety of texts included into lang-ua corpus and special test examples which are not so good for Wikipedia data, although lang-ua corpus includes part of Wikipedia articles as well.
There are several more interesting things: the result does not increse after 150k, the vocabulary size became larger than lang-ua after covering 300k articles.