In this project we selected some books from Gutenburg library from the same category and then select random paragraphs from then and labeled this paragraphs by the book name.
After creating the dataset we used many transformation algorithms to embed the text to numbers for modeling process like (BERT ,TF_IDF,BOW,Skip gram ,Glove ,LDA ,Word2Vec ,Doc2Vec)
After this we tried many ML algorithms and chose the champion one which achieved the highest accuracy.

2. Methodology

2.1 data preparing and Preprocessing

At this step, our task is to have a look at the data, explore its main characteristics: size & structure (how sentences, paragraphs, and text are built), and finally, understand how much of this data is useful for our needs? We started by reading the data.

We used the nltk library to access Gutenberg books. we chose the IDs of five different books but in the same genre.
We displayed the data to see if it needs cleaning. We found the output of the data like this:

Then, we cleaned the data from any unwanted characters, white spaces, and stop words.
We tokenized the data to convert it into words
We converted the cleaned data in lower case.
Then, we lemmatized words and switched all the words to their base root mode
We labeled the cleaned data of each book with the same name.
Then, we chunked the cleaned data for each book into 200 partitions, each partition containing 100 words. So, now we have a (1000x2) Data frame.

Fig.1 cleaned data

2.2 Transformation methods

It is one of the trivial steps to be followed for a better understanding of the context of what we are dealing with. After the initial text is cleaned and normalized, we need to transform it into its features to be used for modeling.

We used some methods to assign weights to particular words, sentences, or documents within our data before modeling them. We go for numerical representation for individual words as it’s easy for the computer to process numbers.

Before starting to transform words. We split the data into training and testing, to prevent data leakage.

2.2.1 BOW

A bag of words is a representation of text that describes the occurrence of words within a document. We just keep track of word counts and disregard the grammatical details and the word order
As we said that we split the data. So, we applied BOW to training and testing data.
We transformed each sentence as an array of wooccurrencesnce in this sentence.

2.2.2 TF-IDF

TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

2.2.2.1 n-gram

We applied in the unigram on training and testing sets. Which creates a dictionary containing n-grams as keys to the diet and a list of words that occur after the n-gram as values.

2.2.3 Doc2Vec

Doc2Vec is a method for representing a document as a vector and is built on the word2vec approach.
We trained a model from scratch to embed each sentence or paragraph of the data frame as a vector of 50 elements.

2.2.4 Word2Vec

Word2vec is a method to represent each word as a vector.
We used a pre-trained model “word2vec-google-news-300”.

2.2.5 GloVe

Global vector for word representation is an unsupervised learning algorithm for word embedding.
We trained a GloVe model on books’ data, that represents each word in a 300x1 Vector. We took the data frame after cleaning and get each paragraph and passed it to the corpus. After that,t we trained the model on each word.
We used also a pre-trained model “glove-wiki-gigaword-300”. Each word is represented by a 300x1 vector. Then, on each word of a sentence in the data frame, we replaced it with its vector representation.

2.2.6 BERT

BERT (Bidirectional Encoder Representations from Transformers) is a highly complex and advanced language model that helps people automate language understanding.
BERT is the encoder of transformers, and it consists of 12 layers in the base model, and 24 layers for the large model. So, we can take the output of these layers as an embedding vector from the pretrained model.
There are three approaches to the embedding vectors: concatenate the last four layers, the sum of the last four layers, or embed the full sentence by taking the mean of the embedding vectors of the tokenized words
As the first two methods require computational power, we used the third one which takes the mean of columns of each word and each word is represented as a 768x1 vector. so, the whole sentence at the end is represented an as a 768x1 vector

2.2.7 LDA (Latent Dirichlet Allocation)

It is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar. LDA is an example of a topic model.
We used LDA as a transformer after vectorization used in BOW because LDA can’t vectorize words. So, we needed to use it after BOW.

We visualized the results of topic modeling Using LDA, it appears as follows

We also measured the coherence per topic of the LDA model:

And the model coherence:

2.2.8 FastText

FastText is a library for learning word embeddings and text classification. The model allows one to create unsupervised learning or supervised learning algorithms for obtaining vector representations for words.
We loaded a pre-trained model from genism API ‘fasttext-wiki-news-subwords-300’.

2.3 Training

After splitting and transforming the data, and extracting features from it we applied many learning algorithms to the data for Clustering such as:

Kmeans is an unsupervised machine learning algorithm in which each observation belongs to the cluster with the nearest mean.
EM clustering is to estimate the means and standard deviations for each cluster to maximize the likelihood of the observed data.
Hierarchical clustering is an algorithm that groups similar objects into groups. The endpoint is a set of clusters, where each cluster is distinct from the other cluster, and the objects within each cluster are broadly like each other.

So, we have 24 models for all our transformation methods.

2.4 Evaluation

After the training phase, we used many metrics to measure the performance of the clustering algorithm with each transformation method:

2.4.1 K-means

K-means is an unsupervised machine learning algorithm in which each observation belongs to the cluster with the nearest mean.

2.4.1.1 Elbow Method

As shown in the figure that the best transformation methods with k=5 are Doc2vec, TF-IDF, BERT, and LDA.

2.4.1.2 silhouette Score

As shown in the figure that the best transformation methods with silhouette score when K= 5 are Doc2vec, TF-IDF, and LDA.

Before comparing the predicted clusters with the human labels, we should take care of the mapping between cluster names to get the right Kappa Score.

This is an Example before mapping

2.4.1.3 Kappa Score

The Highest Kappa score is Doc2Vec with 99.25%.

2.4.1.4 Visualize clusters using Doc2Vec

Clusters with actual Labels:

Clusters with Kmeans:

2.4.2 EM

EM clustering is to estimate the means and standard deviations for each cluster to maximize the likelihood of the observed data.

2.4.2.1 Silhouette Score

The Highest silhouette score is LDA when k=4.

2.4.2.2 BIC Score

The lowest BIC Score is Fast text with k=2.

2.4.2.3 Kappa Score

The highest Kappa score is Doc2vec with 99.6%.

2.4.2.4 Using PCA with the highest silhouette score to visualize clusters

Doc2Vec (Highest Kappa Score):

LDA (Highest Silhouette Score):

Fast Text (Lowest BIC Score):

According to human labels, so the best model is Doc2vec with k=5.

2.4.3 Hierarchical Clustering

Hierarchical clustering is an algorithm that groups similar objects into groups. The endpoint is a set of clusters, where each cluster is distinct from the other cluster, and the objects within each cluster are broadly like each other.

2.4.3.1 Elbow Method

The majority of the models voted for k =4.

2.4.3.2 silhouette score

The majority of the models voted for k =5.

2.4.3.3 Kappa score

The highest Kappa score is Doc2Vev With 99.25%.

So, this is the Hierarchy champion model

Dendrogram of the result

2.4.4 Choosing Champion Model.

As Doc2Vec achieved the best scores in all the models. we applied the three models with doc2vec to choose the champion model

As shown in the figure, Doc2Vec with the EM cluster has the best score among all clusters.

2.5 Error analysis

2.5.1 showing the most frequent words

We draw the most frequent words in the wrong sample. the champion model predicted that this sample belong to cluster 3 and the actual label was 2, and it appeared as follows:

The most frequent words in the actual clusters (2 and 3):

we can observe that the most frequent word in the wrong predict sample is almost similar to cluster 3. Like the word “ship” for example. This can tell us why it was misclustered.
We also wanted to make sure by a seen metric. So, we got the 10 most frequent words in this sample, and how frequent these words were in the true book and the predicted book.

We observed that 5 of the most frequent word in the sample are more frequent in the predicted book.
When the diff column is negative, this shows that this word is more frequent in the predicted label than in the actual label.

2.5.2 Cosine Similarity

We calculated the mean of the actual and predicted books as a vector of 50x1 and calculated cosine similarity between them and the wrong samples. We put the results into a DataFrame, it appeared as follows:

Figure: cosine similarity of true predicted samples

As shown in the figure, we noticed that the true predicted samples have large cosine similarities.
Then we print three random wrong samples to see the results.

This shows that the similarity between book representation and the wrong predicted labels is larger than the actual labels. This gives us an intuition why the machine fails to predict it.

3. Conclusion

After cleaning and preprocessing, we used 8 different transformation methods to apply text clustering. Then, we applied 3 different clustering algorithms to these 8 methods. This resulted in 24 models to evaluate the best transformation method that can work with clustering in our case. As shown in the report, Doc2Vec performed better with all the 3 algorithms. After comparing these 3 models on Doc2Vec, we found that EM with Doc2Vec is the champion model. After this, we performed error analysis using cosine similarity and the most frequent words in the mislabeled documents. And the result shows that most of the mislabeled documents have words that are more frequent in the prediction, not the actual labels.

KerolosAtef/Gutenberg-Text-clustering

Table of Contents

1. Overview