/NLPzoo

A collection of the most popular Natural Language Processing algorithms, frameworks and applications (inspired by RLzoo).

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

NLPzoo

A collection of the most popular Natural Language Processing algorithms, frameworks and applications (inspired by tensorlayer/RLzoo).

Structure

ML (Machine Learning)

  • This folder contains vanilla machine learning models classifiying sample text data. ml_models.py contains the main script used to compare the accuracy of the different models. Models include Naive Bayes and Support Vector Machines. Different vectorizers are also used with each type of model. So far this has not made a significant difference to the accuracy scores of a specific model.

RNN (Recurrent Neural Network)

  • RNNs and flavours of RNNs were the main stay approach for most natural language processing tasks including language modelling, named entity recognition, parts of speech tagging and sentence classification. From 2013 to 2015, Long Short-Term Memory (LSTM) models became the dominant approach. This has since been superseded by RNNs with attention, and CNNs.

Data (not present here)

  • disaster-tweets.csv contains tweets about real disasters and exaggerated 'fake' tweets which are not directly related to any disaster. This dataset is part of the "Real or Not? NLP with Disaster Tweets" Kaggle competition. Download here

  • shakespeare.txt contains the complete texts of William Shakespeare. Download here

  • airline-tweets.csv has tweets from 2015 from travellers expressing their feelings on their flying experience. Download here

  • cornell-movie-dialogs-corpus is the classic Natural Language Processing training dataset. It contains 220,579 conversation exchanges. Download here

  • ubuntu-dialogs-corpus dialogs taken from online chat forums on the topic of Ubuntu. Download here

TODO

The plan for the coming weeks is as follows.

  • Add more models to ML folder.
  • Include Deep Learning models for different varieties of neural networks.
  • Demonstration and implementation of baseline LSTM model to compare with BERT
  • Demonstration and implementation of the BERT language model
  • Other models (order to be decided later): UNILM, MASS, BART

Sources