/Text-Summarization-App

Primary LanguageJupyter NotebookMIT LicenseMIT

Text-Summarization-App

Introduction

  • Abstractive summarization is a technique for creating a summary of a text based on its primary ideas rather than by directly reproducing its most important sentences. This work in natural language processing is crucial and difficult. For this project, the "Text to Text Transfer Transformer Model" popularly known as T5 transformers was used to train our custom dataset to enable it to give us the abstractive summary of the text data.
  • T5 transformers were used because it is flexible enough to be fine-tuned for quite a number of important tasks especially Abstractive Summarization. T5 has also achieved state-of-the-art results in this field.
  • The framework used for this project is pytorch lightning and it was used because of its speed, efficiency, and reproducibility properties.

This project aims to summarize long text of 512 tokens or less, to tokens <= 128 without reproducing the words in the main text and also retaining context. The project was deployed using hugging face spaces with streamlit and this repo also contains a flask app which can be set up locally.

Repository Structure

Project files/folders:

  • Static: This folder contains the CSS file for the U.I of the Flask app
  • Template: This folder contains the HTML files for the home and predict page of the flask app
  • T5 transformers.ipynb: This is the Google Colab notebook used for preprocessing the data and fine-tuning the model.
  • App.py: This is the streamlit file created to be the U.I. file for the model deployed on hugging face spaces.
  • Main.py: This is the flask file created for the deployment of the model on cloud platforms.
  • Dataset

    This project was started by getting a dataset that could work with the T5 transformer model since the model takes in texts and returns text(one of the main reason it is flexible). Data preprocessing and exploratory data analysis were carried out on the dataset before going ahead to use the tokenizer on the model. ##Add graph photo here##

    Pytorch-lightning is the framework used in this project and that's major because it helps organize our pytorch code. Some of the functions created using PyTorch lightning includes:

  • NewsSummaryDataset which was used to tokenize and encode the dataset
  • NewsSummaryDataModule which was used to set the output from the NewsSummaryDataset function and also to load them into dataloaders
  • NewsSummaryModel: this is where the version of T5(t5-base) to be used was specified and downloaded. The model architecture was also fine-tuned for the custom dataset here.
  • Summary: This function is used to generate a summary of a piece of text by the user. This is the function that is used along with the fine-tuned model for deployment.
  • Project Walkthrough.

    The section aims to give a walkthrough of the model training and the deployment aspect of the project.

    Text preprocessing and Model building

    • Setup and library imports: This is the first phase of any Machine learning data and it is at this point we switch to GPU(if necessary), install and import the libraries we need to get the project started(e.g numpy, torch, pytorch lightning). Some of these libraries can be imported/installed later in the project but I prefer to install mine at the beginning of my ipynb notebook.
    • Getting the dataset from kaggle to google colab: Instead of manually downloading the dataset this was done by using the kaggle library and the API command of the dataset on kaggle.
    • Loading the dataset into a data frame.
    • Examining the data frame: The data frame consists of 6 columns, 4 of which are unnecessary for this task. A new data frame containing just the text and summary columns was created.
    • Data cleaning: The column names were changed to relevant names, and rows containing null data were dropped as these rows are just a few and filling them would be hard.
    • Splitting the dataset into test and train data: The new data frame created was split into train and test columns with test_size of 0.1.
    • NewsSummaryDataset: This class was created to encode, tokenize, pad and truncate the dataset. This is done with the t5 tokenizer from the T5 model installed. This class also specify the maximum token length for both the text column and the summary column.
    • NewsSummaryDataModule: This class applies the function above to our dataset and takes into cognizance the train test split done above.
    • NewsSummaryModel: This class contains functions that train and test the model.
    • A checkpoint was created for the model and a trainer which was created from our pytorch lightning module was used to fit the data and the model.
    • A function summarize was created which was used to test the model built. This function takes in the text to be summarized and returns the summary of length <= 128.

    Limitations

    The T5-base model is a very large one and can't be deployed on free cloud platforms because of the size.

    Blockers

    I found it hard to build my Flask app with vscode, I am not sure why but I guess dependency was a major issue. Setting it on Pycharm was less challenging and I had to install RUST, CYTHON and one or two other dependencies in the git bash terminal.

    References

  • https://huggingface.co/docs/transformers/model_doc/t5
  • https://www.youtube.com/watch?v=KMyZUIraHio
  • https://www.sabrepc.com/blog/Deep-Learning-and-AI/why-use-pytorch-lightning
  • https://www.pytorchlightning.ai/