Hierarchical-Attention-Network for Document Classification implementation in PyTorch with a replacement of the traditional BiLSTM with BERT model.
This repository is an implementation of the article Hierarchical Attention Networks for Document Classification (Yang et al.) such that one can choose if to use a traditional BiLSTM for creating sentence embeddings for each sentence or to use BERT for this task (configurable). If one chooses to use BERT in order to create sentence embedding for each sentence, then the rest of the network architecture is the same like in the original paper, i.e. feeding the sentence embeddings into BiLSTM encoder with attention to get a fixed length document vector, that in turn, is fed into a Multi Layer Perceptron with a Softmax activation aligned with the number of different classes of the chosen data set.
Install pipenv
with the following command:
$ pip install pipenv
Open pipenv environment in a new shell:
$ pipenv shell
Add the project to PYTHONPATH:
$ export PYTHONPATH=$PYTHONPATH:/path/to/han/src
Install dependencies:
$ pipenv sync
Download the document classification data sets from my Google Drive folder. Unpack it somewhere to create the following directory structure:
/path/to/data
├── ag_news_csv
│ ├── classes.txt
│ ├── readme.txt
│ ├── test.csv
│ ├── train.csv
├── yahoo_answers_csv
│ ├── classes.txt
│ ├── readme.txt
│ ├── test.csv
│ ├── train.csv
...
Every experiment has its own config file in experiments
.
The pipeline of working with any model version or dataset is:
python run.py preprocess experiment_config_file # Step 3a: preprocess the data
python run.py train experiment_config_file # Step 3b: train a model
python run.py infer experiment_config_file # Step 3c: evaluate the results
Use the following experiment config files to reproduce results:
- AG News, BiLSTM (GloVE embeddings) version:
experiments/han-yahoo-glove-run.jsonnet.jsonnet
- AG News, BERT (base) version:
experiments/han-yahoo-bert-run.jsonnet.jsonnet
- Yahoo Answers, BiLSTM (GloVE embeddings) version:
experiments/han-yahoo-glove-run.jsonnet
One may add new configuration files from other data sets or even play with the hyper-parameters of the existing configuration.
The infer
step will output the classification report against the test set of the desired data set.
For example, on the AG News
data set, with BiLSTM (GloVE embeddings) sentence encoder:
precision recall f1-score support
World 0.94 0.93 0.93 1900
Sports 0.98 0.99 0.98 1900
Business 0.89 0.91 0.90 1899
Sci/Tech 0.92 0.90 0.91 1900
accuracy 0.93 7599
macro avg 0.93 0.93 0.93 7599
weighted avg 0.93 0.93 0.93 7599
One can visualize the sentence/word attention weights per each item in the test set, after running the infer
command,
using the notebook notebooks/Prediction Visualizer.ipynb
.
Please note that one may need to change the value of PREDICTIONS_PATH
when using this notebook.
For example, for item in index 200, we will notice that the 2nd sentence (out of 2) got the most attention and same goes for
the phrases: broadband users and internet users that had the highest weights when determining the prediction of
class Sci/Tech
:
[1] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, Eduard Hovy, Hierarchical Attention Networks for Document Classification
@inproceedings{yang-etal-2016-hierarchical,
title = "Hierarchical Attention Networks for Document Classification",
author = "Yang, Zichao and
Yang, Diyi and
Dyer, Chris and
He, Xiaodong and
Smola, Alex and
Hovy, Eduard",
booktitle = "Proceedings of the 2016 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jun,
year = "2016",
address = "San Diego, California",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/N16-1174",
doi = "10.18653/v1/N16-1174",
pages = "1480--1489",
}
[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, {BERT}: Pre-training of Deep Bidirectional Transformers for Language Understanding
@inproceedings{devlin-etal-2019-bert,
title = "{BERT}: Pre-training of Deep Bidirectional Transformers for Language Understanding",
author = "Devlin, Jacob and
Chang, Ming-Wei and
Lee, Kenton and
Toutanova, Kristina",
booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
month = jun,
year = "2019",
address = "Minneapolis, Minnesota",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/N19-1423",
doi = "10.18653/v1/N19-1423",
pages = "4171--4186",
abstract = "We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5 (7.7 point absolute improvement), MultiNLI accuracy to 86.7{\%} (4.6{\%} absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).",
}