/nlp-text-classifier

NLP for classifying text. Using word Word2Vec word embedding and a neural net with bidirectional LSTM to categorize sentences provided by the user 🤔

Primary LanguageJupyter Notebook

RUAK - Are you a Hegel? 🧐

The School of Athens
The greatest challenge to any thinker is stating the problem in a way that will allow a solution.
Bertrand Russell

About this project

Philosophy is a fundamental human thought movement. Everyone is a philosopher. The only question is what kind of philosopher you are. This project tries to answer that question.

Using natural language processing (NLP), texts of different authors are used for categorisation. With the help of these texts any sentence can be categorically determined. To understand how written language works and what the differences are between authors it helps to analyse the context of the sentences. Though visualisation it is simpler to see structural varieties such as average sentence length, word class ratio and the use of stop words.

About the notebooks

text_classifier.ipynb

This notebook is the heart of the project. More details about this notebook and how to use it can be found in the introduction of it, right at the top of the notebook. You can open it in Google Colab to use a GPU and have a nice platform for editing. There you can run it out of the box. No setup needed!
Open In Colab

Content

  1. Preparations
  2. Loading text data
  3. Collect and clean data
  4. Creating DataFrame and extract new data
  5. Store or load DataFrame
  6. Visualization of data
  7. Preparation, splitting and normalization
  8. Hyperparameter tuning
  9. Model preparation and training
  10. Save or load model
  11. Evaluation
  12. TensorBoard

Preparation & Loading text data

Those two part contains the imports, some additional downloads, the handling of the stop words, loading the text from the files provided, and the variables (paths, etc.) which must be set.

Collect data and create word collection

Here the texts are divided into individual sentences. The tokenizer of NLTK is responsible for this. In addition, the respective sentences are provided with a label. Afterwards sentences are removed, which are either too short or too long. The default values of minimum and maximum length are 6 and 400 words respectively.

Create and extend DataFrame

In this part the Pandas library becomes in handy. Its task is to create a data frame and create new columns out of the existing information:
  • author a more readable form of the label
  • word_count
  • mean_word_length
  • stop_words_ratio The ratio of stop word to all words
  • stop_words_count
  • If POS tagging is activated another 16 columns are added:
    • ADJ_count adjective count
    • ADV_count adverb count
    • ADP_count adposition count
    • AUX_count auxiliary count
    • DET_count determiner count
    • NUM_count numeral count
    • X_count other count
    • INTJ_count interjection count
    • CONJ_count conjunction count
    • CCONJ_count coordinating conjunction count
    • SCONJ_count subordinating conjunction count
    • PROPN_count proper noun count
    • NOUN_count noun count
    • PRON_count pronoun count
    • PART_count particle count
    • VERB_count verb count
    For more information visit spacy's API documentation.

Store or load DataFrame

Since the process of creating the data frame can take quite a while. A data frame can be stored and loaded for later use.

Visualization of data

In this part it is all about visualizing the data so it can be understood easier. First the data is prepared for showing it using the Matplotlib library.
Data distribution
Shows the shares of data for each author:
Seems like Kant's share is too big 🤭.

Data distribution

and the distribution of word length by author:
Word length distribution

Distribution of sentence length by author:
Why does Hume has so many sentences 🤔?
Sentence length distribution

Comparing authors
Shows the differences between the authors for 4 metrics: Number of sentences, Median sentence length, Unique vocabulary count, Median stop word ratio.
Hume does not only has a lot sentences, but also very long onces 😯.

Comparing authors

Word classes by authors
Presents the ratio of authors total used words to word classes:
Plato's sentences seem different to the others. Probably because most of his texts are debates 🤓.

Word classes by authors

Common words
Gives an overview of the number of sentences containing one if the most 20 common words:
I would have suspected 'reason' in one of the first places 🧐.

Common words

Sentence representation
To get understand the structures of the sentences you can visualize it:
Classical Nietzsche 😎

Sentence representation

Prepare and split

This step prepared the data for the Tensorflow model. To process the text data it needs to be tokenized and encoded. Keras preprocessing methods are used for this. texts_to_sequences encodes the text to a sequence of integers.
Each sequence is padded to the longest available sequence using pad_sequences.
The collected metadata (e.g. number of stop words, etc.) gets normalized, not used columns get removed. And afterwards two data frames are concatenated.
Afterwards scikit-learn's train_test_split method is used to split the data.
At the end two sets of train, validation and label arrays are created for hyperparameter seach and training the model.

Hyperparameter tuning

Instead of manually searching for the best hyperparameter used by the model. In this project Keras Tuner is used.
There are two different ways to create the weights for the embedding layer. You may create your own Word2Vec model using the embeddings_trainer.ipynb. For the English language it is also possible to use the weights from the Word2Vec model provided by Tensorflow Hub.

At the beginning of the step the Word2Vec model is loaded which can be created
The hypermodel function contains the definition of the model and the ranges for tuning the hyperparameters. The following parameters can be tuned:

  • hp_dense_units - Number of units in dense layers
  • hp_lstm_units - Number of units in LSTM layers
  • hp_dropout - Dropout rate
  • hp_learning_rate - Learning rate parameter for the optimizer
  • hp_adam_epsilon - Epsilon parameter for Adam
Keras Hyperband uses the model to create a tuner - parameters:
  • executions_per_trial - Number of models that should be built and fit for each trial for robustness purposes
  • max_epochs - The maximal number of epochs. This number should be slightly bigger than the epochs for the fitting process
  • hyperband_iterations - The number of times to iterate over the full Hyperband algorithm
The created tuner is then used to search for the best parameters which are returned by get_best_hyperparameters. A collection of the best models are returned by get_best_models.

This image shows a possible model found by the Keras tuner search:

Model structure


The Model contains two inputs. One is used for passing the encoded and padded sentences to the embedding layer. The other input handles the generated metadata. Later they get concatenated before the model ends with a Dense layer having the number of units equal to the classes available (authors).

Model preparation and training

Using the fit method of the selected model - here it gets trained using the train and validation data. Three different callbacks are used:
  • Tensorboard - For collecting the data for presentation in TensorBoard
  • ReduceLROnPlateau - Reduce learning rate when a metric has stopped improving
  • EarlyStopping - Stops the training if no progress in learning

Save or load model

Here the model can be stored for later usage or loaded if it should be used in the next steps.

Evaluation

Draw charts to show compare training and validation results and try custom sentences to be classified by the trained model.

TensorBoard

Open Tensorboad to get an detailed overview of the training process.

embeddings_trainer.ipynb

The embeddings_trainer notebook contains a collection of functions to train Word2Vec, Doc2Vec and FastText models. After some test the outcome was, that the Word2Vec embedding model works best for this case.

About the crawler

The ruakspider folder contain all stuff needed for crawling over certain websites to get text as training data for the classification model and text for the training of the Word2Vec embedding model.