Machine Learning Models' Battle on Text Classification ⚔️

In this project, I fine-tuned different classical machine learning models and neural networks on Text Classification and Topic Modeling tasks. The results of the evaluation are explained in the "reports.pdf" file.
The architecture of the models is explained in the figures below:

Latent Semantic Analysis (LSA): The objective of LSA is reducing dimension for classification. The idea is that words will occurs in similar pieces of text if they have similar meaning. The model’s architecture is:

Latent Dirichlet Allocation (LDA): In LDA, documents are represented as a mixture of topics and a topic is a bunch of words. Those topics reside within a hidden (latent) layer.

Term frequency Inverse Document Frequency (TF-IDF) and TF-IDF Character Gram: In this case we applied TF-IDF on character level. Instead of computing the occurrences of the words, the occurrences of the characters are computed. by setting the range (2,6) which means using from the bigrams, trigrams, 4-grams, 5-grams and 6-grams.

FastText Model: A very fast model for computing word representations. Each word is represented as a bag of character n-grams in addition to the word itself. For example, for the word “apple”, with n = 3, the fastText representations for the character n-grams is <ap, app, ppl, ple, le>. This allows to capture meaning for suffixes/prefixes. The n-gram featuers are averaged to form the hidden layer.

BERT Model: Bidirectional Encoder Representations from Transformers (BERT) is based on the self- attention mechanism of the Transformers models. Every token in the input sequence is related to other tokens. The powerful mathematics behind this concept enables the model to capture context-based features while generating word embeddings. In this project, the “bert-base-uncased” model was used in all experiments to extract the word embeddings. The embeddings are then fed to a linear classifier to predict the classes. Also a dropout layer was added between the embeddings layer and the classifier function. The model’s architecture can be visualized as follows:

Combining Models Together

What happens if we combine embeddings of two different models together? Either the hidden features captured by the models will complete each other, or the classifier will just get confused by the vectors from different semantic spaces!

One way to know that is hands-on experiences! Let's combine embeddings from BERT and FastText and evaluate the results. To do so, I applied three different methods:

Concatenate words (BERT, FastText), sum for sentence:
- Compute BERT words embeddings for each word separately.
- Compute FastText words embeddings for each word separately.
- Combine embeddings for each word separately by concatenating BERT word embeddings with FastText word embeddings.
- At this stage we have a combined representation for each single word. To compute the sentence embedding, we sum the embeddings of all the words

Sum words (BERT, FastText), sum for sentence:
- Compute BERT words embeddings for each word separately.
- Compute FastText words embeddings for each word separately.
- Combine embeddings for each word separately by summing BERT word embeddings with FastText word embeddings.
- At this stage we have a combined representation for each single word. To compute the sentence embedding, we sum the embeddings of all the words

Compute each sentence representation separately (sum words), then combine the two sentence representations by sum:
- Compute BERT words embeddings for each word separately.
- Compute BERT sentence embeddings by summing BERT word embeddings.
- Compute FastText words embeddings for each word separately.
- Compute FastText sentence embeddings by summing FastText word embeddings.
- Combine the two Sentence Embeddings by summing BERT sentence embeddings with FastText sentence embeddings.

rayenebech/ml-battle

Machine Learning Models' Battle on Text Classification ⚔️

Combining Models Together