/language_models

Primary LanguageJupyter Notebook

Repository report

This is an introductory repo to different architectures of Language Models (LM), trained and tested on the Penn Treebank. Language Modeling is the task of developing a model that can predict the probability of a sequence of words or tokens given a context or input.

Contents:

A. 3-gram model with Laplace smoothing
B. LSTM model:
    - case I) with learnable embeddings
    - case II) with pretrained embeddings
C. Pre-trained transformer model
D. Results
E. Text Generation & Discussion (to do)
F. Future improvements

The Penn Treebank is downloaded from nltk and the sentences come in tokenized form.In our analysis, we consider all tokens in lower letter format, except for the '-LRB-', '-RRB-', '-LSB-', '-RSB-', '-LCB-', '-RCB-' tokens describing parentheses variations. The numbers and punctuation symbols were also preserved. In addition, a token is considered unknown, '< unk>' token, if it appears less than 3 times in the training tokens set. Based on this we construct the vocabulary V, which contains the set of words that the model sees during training. In turn is used for replacing with '< unk>' test tokens not included in it. We note that the vocabulary of the 3-gram model is slightly larger to the neural models ones, since for the neural models we had to define a small validation set as well. The test set is the same for all models.

A. 3-gram model with Laplace smoothing

  • Training-Test data: 3576-338 sentences

  • 2-grams (sequences of 2 words): for each tokenized sentence add one '< bos>' token at the beginning and one '< eos>' token at the end. Then extract the resulting 2-grams per sentence.

  • 3-grams (sequences of 3 words): for each tokenized sentence add two '< bos>' tokens at the beginning and two '< eos>' tokens at the end. Then extract the resulting 3-grams per sentence.

  • Calculate 3-gram model with add-1 smoothing:

    The model learns to calculate next word probabilities given the previous two words, as per following formula:

    The presence of 1 to the numerator and |V| (= vocabulary size) to the denominator ensures that the model does not assign zero probability to any trigram, unseen (i.e. C(w_i-2, w_i-1, w_i)=0) or not.
  • Test model performance by calculating perplexity over the test 3-grams.

    In the above formula we note that 'log' refers to the natural logarithm (base e).

B. LSTM model

  • Training-Validation-Test data: 3262-314-338 sentences

  • Embedding layer: In order to feed words into a neural language model we must create their vector representations. This is achieved via an embedding layer. which is put at the beginning of the neural architecture. This layer takes as input an integer/categorical representation of each word and maps it into a continuous-valued vector of desired length (embedding_dim hyperparameter). This layer could be either trainable (case I) or pre-trained (case II).

    In regards to case II, we consider 300d pre-trained 6B-GloVe embeddings, which are kept frozen during training. We note that the embeddings do not contain representation for the '< eos>' and '< unk>' tokens. In our implementation, we assign the mean of all GloVe vectors to the '< eos>' token and a random vector, with values between GloVe min and max values, to the '< unk>' token. In addition, we note that there are 34 tokens included in the vocabulary of case I model (3259 size) which do not have a GloVe representation. To this purpose, in order to assign all vocabulary words to a GloVe embedding, we replaced these tokens with '< unk>' as well, resulting in a slightly smaller vocabulary (3225 size). This simple approach is one of many available to tackle this issue (see section F as well).

  • In order to train this kind of models, we first put all the text tokens in a large input sequence, via their integer representation, and then process it in a sequential manner. To this purpose, we choose a hyperparameter called sequence_length and map sequences of length sequence_length to the next token. This procedure takes place iteratively, sliding over the token sequence and shifting -at each time step- the target token of interest by one position to the right.

    At time step t, the loss is determined by the probability the model assigns to the correct next word (which is known since we know the text). This learning approach is often called teacher forcing. For a sequence of L training tokens, the Cross-Entropy (CE) loss is given by the formula below. For any step t, due to the recurrence in the calculation of hidden states (i.e. h_(t+1) depends on h_t), the prediction y_(t+1) can be computed as long as y_t can be computed. This phenomenon results in a sequential/serial loss calculation over the time steps.

  • LSTM language model general architecture:

    Due to the nature of language modelling task, in the LSTM layer below we focus on the last time-step output only.

    (N,L+1) --> Embedding --> (N,L,E) --> LSTM --> (N,H) --> Classification --> (N,|V|)   
     input        layer       matrix    layer(s)   matrix        layer          matrix
    
       where N: batch size
             L: integer-sequence length used to predict the next token (integer)
             E: embedding dim
             H: hidden dimension size (i.e. units) per LSTM layer
           |V|: vocabulary V size           
    
  • For this kind of models, the perplexity formula, introduced in section A, can be adjusted accordingly as per above loss formula.

C. Pre-trained transformer model

  • Training-Validation-Test data: 3262-314-338 sentences

  • We consider a pre-trained 'small' GPT2. During training we keep the embedding and transformer layers frozen and tune the linear 'head' to the needs of the training set.

  • Similar to the LSTM model, we create an integer representation of the training tokens, put them in a large input sequence and choose a sequence_length hyperparameter value. We train the model by mapping a sequence of sequence_length length to the sequence which is the initial one shifted by one time-step to the future. In contrast to the recurrent models, this kind of models process the input sequence w_1,..,w_L (L=sequence_length) in parallel; using the inputs w_1,..,w_k to calculate y_k, for k<=L. This results in L predictions y_1,..,y_L, whose losses are calculated in parallel as well.

    As per below formula, for a sequence w_1,..,w_k, the loss is determined by the probability the model assigns to the correct next word w_(k+1), mainting the teacher forcing approach of recurrent nets.

  • small GPT2 architecture: one may refer to the original publication

  • For this kind of models, the perplexity formula, introduced in section A, can be adjusted accordingly as per above loss formula.

D. Results

On the test set of 338 sentences:

Model Perplexity Complexity
3-gram with Laplace smoothing 1082.93 -
LSTM w/ learnable embeddings 248.95 2.9M
LSTM w/ GloVe embeddings 195.72 1.9M
GPT2 w/ trainable head 139.07 2.5M

E. Text Generation & Discussion

(to do)

F. Future improvements

  1. As far as the LSTM with pre-trained embeddings is concerned, we will implement a more advanced approach to deal with vocabulary words that do not have a pre-trained representation (ex. subword embeddings or contextualized word embeddings).
  2. The choice of models hyperparameters values (see 'hyperparameters.txt' in the path 'notebooks/link_to_learned_models.md') is currently based on case-by-case experimentation. One may utilize Bayesian optimization techniques for more thorough tuning; ex. use of ray.tune. Due to limited hardware resources this task will be postponed for now.