/summary-source-prediction

Prediction of summary source in Python.

Primary LanguagePython

Summary Source Prediction

The goal of this project is to study and apply machine learning/artificial intelligence techniques to predict whether a summary is written by human or generated by the machine. We will evaluate our methods on a list of summaries. The dataset contains the original documents, reference summaries written by humans in to addition to machine generated summaries using transformer based seq2seq models. Finally, additional documents are provided for use (for example, to create embeddings).

The data set will eventually be publicly available on the Data Science and Mining Team (DaSciM) website.

Please refer to the following sections for more information about the package usage:

  1. Our results
  2. Installation
  3. Description
  4. Usage via command lines
  5. Documentation

Our results

A brief summary of our results is available in our report under report/report.pdf. Below, we only give a summary table of the test accuracy of different models.

Model Test accuracy Features
Random Forest 0.82062 base regex
Stacking 0.89937 base regex + tf-idf (cos, pca) + space_before_ponct_count
Stacking 0.90625 base regex + tf-idf (cos, pca) + space_before_ponct_count + PoS-tagging
LightGBM 0.91312 base regex + tf-idf (cos, pca, lda) + space_before_ponct_count + PoS-tagging
LightGBM 0.91875 base regex + tf-idf (cos, pca, lda) + space_before_ponct_count + PoS-tagging + GLTR
CatBoost 0.92812 feature selection

Installation instructions

In order to use our package and run your own experiments, we advise you to set up a virtual environment. The package has been tested under Python version 3.7.12, you will also need the virtualenv package:

pip3 install virtualenv

Then, you can create a virtual environment and switch to it with the following commands:

python3 -m venv myvenv
source myvenv/bin/activate (Linux)
myvenv/Scripts/Activate.ps1 (Windows PowerShell)

All the needed packages are listed in the requirements file, you can install them with:

pip3 install -r requirements.txt

This file expects you to have PyTorch version 1.11 with CUDA>=11.3 installed on your machine. If it is not the case, install the version via command line or install your preferred version locally then remove the lines related to torch in the requirements.txt file and use the command again.

Package description

Below, we give a brief tree view of our package.

.
├── doc  # contains a generated documentation of src/ in html
├── report  # contains our complete report in pdf format
|   └── figures
├── src  # source code
|   ├── engine
|   |   ├── models
|   |   |   ├── __init__.py
|   |   |   ├── base.py  # scikit-learn compatible classifiers and manual stacking
|   |   |   └── deep.py  # feed-forward and lstm networks w. embedding support
|   |   ├── __init__.py
|   |   ├── gridsearch.py
|   |   ├── hub.py  # to prepare data and create models
|   |   └── training.py 
|   ├── preprocessing
|   |   ├── features  # multiple files for each type of features
|   |   ├── reduction  # multiple files for feature selection
|   |   ├── __init__.py
|   |   └── reader.py  # to read preprocessed files
|   ├── utils 
|   ├── __init__.py
|   ├── data_cleaning.py  # simple function to clean texts and convert to csv
|   ├── data_preparation.py  # main file to compute features
|   └── main.py  # main file to run gridsearch
├── README.md
├── model_selection.ipynb  # selection of features and models
├── model_finetuning.ipynb  # finetuning of models from huggingface database
├── embedded_models.ipynb  # LSTM models based on word embeddings
└── requirements.txt  # contains the necessary Python packages to run our files

Package usage

Downloading the data

The data set will eventually be publicly available on the Data Science and Mining Team (DaSciM) website. If it is the case, you can place the train_set.json, test_set.json and documents.json files into a data/ folder.

If you want to use word embeddings, this package expects you to have downloaded either GloVe vectors or Google News vectors and put them into a data/embed/ folder. Otherwise, you can also train your own embeddings using the provided documents.

Notebooks

In order to use the notebooks, you will also need to install jupyter:

pip3 install jupyter notebook ipykernel
ipython kernel install --user --name=myvenv

There are three available notebooks:

  • model_selection.ipynb: this notebook allows to test different machine learning models and subset of features
  • model_finetuning.ipynb: this notebook allows to finetune models from huggingface database which can achieve great results!
  • embedded_models.ipynb: this notebook allows to try out deep models with word embeddings

Feature engineering

You can use src/data_preparation.py to create a data set of features for all base machine learning models:

python3 src/data_preparation.py [options]
  • --seed: Seed to use everywhere for reproducibility. Default: 42.

  • --regex-feat: List of all regex features to compute. The name of a regex feature should be of the form "A_B". A is the name of the regex expression, for example "upper_word". B is the type of feature we want, it can be "count" for the number of instances, "avg" for the average length of instances, "overlap" for the number of instances both found in the summary and the original document or "ratio" for the ratio between the number of instances found in the summary and the number of instances found in the document. Example: --regex-feat char_count group_overlap.

  • --idf-feat: List of all tf-idf features names to compute. The name of an idf feature is expected to be of the form "A_B". A is a sequence of characters for the composition performed in src.preprocessing.features.tfidf.idf_composition(). B is the type of feature we want, it can be "count" for the term frequencies, "lda" for the latent dirichlet allocation, "idf" for idf features after a PCA transformation that keeps svd_components components from the initial counter features or "cos" for the cosine similarity between the summary and the original document.

  • --idf-d: Parameter value for "d" composition in tf-idf. Default: 0.1.

  • --idf-b: Parameter value for "p" composition in tf-idf. Default: 0.5.

  • --idf-k: Parameter value for "k" composition in tf-idf. Default: 1.0.

  • --idf-svd: Number of components for the truncated SVD component analysis in tf-idf. Default: 1.

  • --tag-feat: List of all tagging features to compute. The name of a tagging feature must be of the form "A_B". A is a tag from all available tags in nltk or "tags". B is the type of feature we want, it can be "count" for the number of instances, "avg" for the average length of instances, "overlap" for the number of instances both found in the summary and the original document, "ratio" for the ratio between the number of instances found in the summary and the number of instances found in the document or "C_D" where C and D are parameters for idf features.

  • --gltr-feat: List of all GLTR features to compute. The name of a GLTR feature should be of the form "A_B". A is the name of a topk value computed by GLTR, it can be "count" or "frac". B is the number of bins to compute the feature.

  • --embed-feat: List of all embed features to compute. The name of a embed feature should be of the form "A". A is the name of an embedding from "glove" and "google", or if any other name, it will be trained.

  • --fixed-poly-feat: List of specific polynomial features. A fixed polynomial feature must be of the form "A B C". A, B and C can be features or powers of features, and their product will be computed. Example: --fixed-poly-feat "char_count^2 group_overlap".

  • --poly-feat: List of polynomial features to compute of which interaction terms will be computed.

  • --all-poly-feat: Use this option to activate polynomial interaction of all features. Use with caution. Default: Deactivated.

  • --poly-degree: Define the degree until which products and powers of features are computed. If 1 or less, there will be no polynomial features. Default: 2.

  • --excl-feat: List of features names to drop after computation. Example: --excl-feat "char_count^2 group_overlap".

  • --max-correlation: Correlation threshold to select features. Default: 1.0.

  • --rescale-data: Use this option to activate rescaling the data sets. Default: Activated.

  • --no-rescale-data: Use this option to deactivate rescaling the data sets. Default: Activated.

  • --scaling-method: If "standard", features are rescaled with zero mean and unit variance. If "positive", features are rescaled between zero and one. Default: "standard".

  • --pca-ratio: Variance ratio parameter for the Principal Component Analysis. Default: 1.0.

  • --save-data: Use this option to activate saving the data sets. Default: Activated.

  • --no-save-data: Use this option to deactivate saving the data sets. Default: Activated.

  • --file-suffix: Suffix to append to the training and test files if save_data is True. Default: "final".

To use embedded models such as LSTM, you will need to use src/data_cleaning.py:

python3 src/data_cleaning.py [options]
  • --clean-text: Use this option to activate the application of a slight cleaning of the data, that is, removing extra white spaces. Default: Activated.

  • --no-clean-text: Use this option to deactivate the application of a slight cleaning of the data, that is, removing extra white spaces. Default: Activated.

  • --save-data: Use this option to activate saving the data sets. Default: Activated.

  • --no-save-data: Use this option to deactivate saving the data sets. Default: Activated.

  • --file-suffix: Suffix to append to the training and test files if save_data is True. Default: "final".

Gridsearch

Then, you can use the src/main.py file to try multiple gridsearch and models. The command is as follows:

python3 src/main.py [options]
  • --seed: Seed to use everywhere for reproducbility. Default: 42.

  • --models-names: Choose models names. Available models: "rfc", "xgboost", "lightgbm", "catboost", "mlp", "logreg", "etc", "stacking" and "embed_lstm".

  • --data-path: Path to the directory where the data is stored. Default: "data/".

  • --file-suffix: Suffix to append to the training and test files. Default: "final".

  • --trials: Choose the number of gridsearch trials. Default: 25.

  • --submission: Use this option to activate submitting a file. Default: Activated.

  • --no-submission: Use this option to deactivate submitting a file. Default: Activated.

  • --metric: Evaluation metric for parameters gridsearch. Available metrics: "accuracy" and"f1_weighted". Default: "accuracy".

Documentation

A complete documentation is available in the doc/src/ folder. If it is not generated, you can run from the root folder:

python3 -m pdoc -o doc/ --html --config latex_math=True --force src/

Then, open doc/src/index.html in your browser and follow the guide!