Final project for SNLP course summer semester 2021
Please look inside final_submission
directory for final version of notebook code on English and Bengali language. Here, we present a succinct summary of the whole project with analysis of results.
With this project, we aim to estimate OOV words using subword representation. To achieve this we train RNN based language model to artificially generate corpus and compute OOV rate on varying sizes of the generated corpora. Estimating OOV words helps improve the performance of the language model. In this work, we achieved a better OOV rate and perplexity score than the baseline for all three levels of granularity with appropriate hyperparameter tuning.
We begin with preparing the data. The given corpus is segmented into sentences and further split into train and test set in an 80:20 ratio. Segmentation helps SentencePiece to act on the sentence level and implement subwords. All punctuation marks are kept intact to retain richness in the generated text. However, Bengali corpus required manual curation like stripping off English strings, country flag symbols, and repeated occurrence of the same punctuation. Also, the lines are segmented into sentences based on punctuation marks like ?
,!
,|
.
Next, subword units are learned with SentencePiece at three different granularity levels, i.e., characters, smaller vocabulary, and a larger vocabulary. These subwords are used to train the RNN language model based on different subword granularity. The goal is also to perform hyper-parameter tuning to improve baseline perplexity scores.
Further, we use the trained RNN based language model to create artificial data for each granularity level.
Finally, we compute the out-of-vocabulary (OOV) rate on the given corpus and compare it with the OOV rate computed after augmenting the training vocabulary.
We observe that with character level granularity, the generated text is segmented on the character level. With granularity set to smaller subwords units closer to characters, the length of segmented subwords is longer and many words are also considered as subwords. E.g. for Bengali, the tokens are combinations of a few characters but not a complete word. E.g, the word পাওয়া
(pronounced as Pā'ōẏā
) is segmented as ▁পা
an ওয়া
. For larger vocabulary granularity levels, subwords are longer. Many words are also segmented into single subwords. We also consistently see that longer words are broken into two or more segments.
Upon inspection of artificially generated text using rnnlm trained model:
- Character-level granularity, there is a structure of the sentence and many generated words are also real. However, the grammar is very bad and some words don't exist in English.
- Small subword granularity, there are more real words. It still lacks grammar and most sentences don't make sense.
- Larger subword vocabulary, more words are real. Overall, more information and richer meaning is delivered.
With our experiments following parameters are chosen based on better perplexity and OOV rate (more preferred). Parameters table
Params\Models | Baseline (fixed) | Character | Smaller Vocab | Larger Vocab |
---|---|---|---|---|
hidden | 40 | 70 | 100 | 140 |
bptt | 3 | 5 | 6 | 1 |
class | #vocab_size | 72 | 650 | 1600 |
Perplexity scores
Model\ Granularity | Characters | Smaller vocabulary | Larger vocabulary |
---|---|---|---|
Baseline | 8.256309 | 68.993461 | 71.323803 |
Tuned | 7.351029 | 67.478303 | 70.26725 |
OOV rate
OOV on given (original) corpus: 0.1073
Model\ Granularity | Characters | Smaller vocabulary | Larger vocabulary |
---|---|---|---|
Baseline | 0.0754 | 0.06797 | 0.07443 |
Tuned | 0.07482 | 0.06679 | 0.07345 |
Upon inspection of artificially generated text using rnnlm trained model:
- Character-level granularity: The generated text is entirely meaningless and grammatically incorrect. The words in themselves are inexistent and carry no meaning.
- Small subword granularity: There seem very few real and meaningful words, but the generated text as in whole carries no meaning.
- Larger subword granularity: Text seems to be the best of the above, although is partially meaningful. It has more correct words and some of which could be attributed to Bangladeshi Bengali dialect.
We observe that with the increasing number of hidden layers and the number of steps to propagate error (bptt), the perplexity declines as well as OOV improves.
Parameters table
Params\Models | Baseline (fixed) | Character | Smaller Vocab | Larger Vocab |
---|---|---|---|---|
hidden | 40 | 120 | 120 | 120 |
bptt | 3 | 4 | 4 | 4 |
class | #vocab_size | 52 | 400 | 3000 |
Perplexity scores
Model\ Granularity | Characters | Smaller vocabulary | Larger vocabulary |
---|---|---|---|
Baseline | 10.026335 | 62.794973 | 376.060040 |
Custom | 7.311287 | 49.141582 | 365.665047 |
OOV rate
OOV on given corpus: 0.157
Model\ Granularity | Characters | Smaller vocabulary | Larger vocabulary |
---|---|---|---|
Baseline | 0.1352 | 0.1248 | 0.1143 |
Custom | 0.1297 | 0.1202 | 0.1133 |
As can be concluded from the table above, the hyper-parameters that outperform the baseline results are 120 hidden layers and 4 bptt. It can be mentioned that with 120 hidden layers, the training time is quite significant. We also experimented with lower hidden layers and bptt, for which the OOV rates are less than those of baseline. However, we believed increasing the hidden layers and bptt would be a better choice for achieving an improved OOV rate and lower perplexity.
We selected the following vocab size for English and Bengali.
Granularity | Characters | Smaller vocabulary | Larger vocabulary |
---|---|---|---|
English | 72 | 650 | 1600 |
Bengali | 52 | 400 | 3000 |
- Table: Comparison of Perplexity and OOV rate vs Vocabulary size for English.
Vocab Size | PPL | OOV |
---|---|---|
72 | 7.351 | 0.07482 |
250 | 38.00 | 0.07169 |
450 | 57.866 | 0.07032 |
650 | 67.478 | 0.06679 |
1600 | 70.267 | 0.07345 |
2000 | 61.796 | 0.07933 |
2500 | 47.102 | 0.07972 |
- Table: Comparison of Perplexity and OOV rate vs Vocabulary size for Bengali.
Vocab Size | PPL | OOV |
---|---|---|
52 | 10.026 | 0.1353 |
80 | 16.489 | 0.1347 |
100 | 21.925 | 0.1331 |
200 | 41.832 | 0.1301 |
400 | 62.795 | 0.1248 |
800 | 124.997 | 0.1203 |
2000 | 278.289 | 0.1154 |
2500 | 326.707 | 0.1147 |
3000 | 376.060 | 0.1143 |
4000 | 448.427 | 0.1127 |
The OOV rate decreases as the size of generated corpus (
Our results differ for English and Bengali in the following ways:
-
For larger vocabulary Bengali is found to have much higher perplexity than English. This may be due to the fact Bengali is a morphologically richer language than English.
-
We also observed that while Bengali showed a uniform increase in perplexity and decrease in OOV rate with increasing vocab size. However, this behavior for English is not consistent. Net perplexity decreased by 8.47 when vocab size increased from 1600 to 2000. While the OOV rate is expected to decrease with increasing vocabulary size, it escalated continuously from 0.0619 to 0.0797 while increasing vocabulary size from 650 to 2500.
-
For all granularity levels, English had a much lower OOV rate than Bengali. This is again can be explained as Bengali is a morphologically rich language.
We analyzed how subwords can be used as efficient way of estimating OOV words and improve the performance of the language model. The results might vary depending on how morphologically rich is the language. We also observed a uniform relation of vocab_size
with perplexity and OOV rate i.e. with an increase in vocabulary size, the perplexity escalated, while the OOV rate dipped slowly. However, this behavior is inconsistent for the English language.
Here are few ways we could improve the results:
- We could try training rnnlm with more hidden layers and bptt but at a cost of increased training complexity.
- We can also search for optimal vocabulary sizes extensively. Some techniques like grid search with small changes of values could give an improvement in model performance.
- better neural network architectures like transformers can also improve the results.