This is the repository for our EMNLP 2021 Findings publication: Recommend for a Reason: Unlocking the Power of Unsupervised Aspect-Sentiment Co-Extraction. Procedures implemented in this repo are Aspect-Sentiment Pair Extractor (ASPE) and Attention-Property-aware Rating Estimator (APRE). This document introduces how to reproduce the experiments of ASPE + APRE.
Please cite our paper via the following BibTex:
@inproceedings{aspeapre,
author = {Zeyu Li and
Wei Cheng and
Reema Kshetramade and
John Houser and
Haifeng Chen and
Wei Wang},
title = {Recommend for a Reason: Unlocking the Power of Unsupervised Aspect-Sentiment Co-Extraction},
booktitle = {Proceedings of the 2021 Conference on Empirical Methods in Natural
Language Processing: Findings, {EMNLP} 2021, 7-11 November 2021,
Online and in the Barceló Bávaro Convention Centre, Punta Cana, Dominican Republic},
series = {Findings of {ACL}},
volume = {{EMNLP} 2021},
publisher = {Association for Computational Linguistics},
year = {2021},
}
We use the following datasets: Amazon Two external resources are also required: GloVe pretrained word vectors and BERT pretrained parameters.
We used the 5-core version. The downloaded files are in the .json.gz
extension. After decompressing, a json file will be obtained (e.g., Office_Products_5.json.gz
to Office_Products_5.json
). Please rename it to office_products.json
and move it to raw_datasets/amazon/office_products.json
since the preprocessing pipeline will locate the files to process by names. In this case, the office_products
should be given to the --amazon_subset
flag.
GloVe is a pre-trained embedding vector popularly used for a wide range of NLP tasks.
GloVe can be downloaded from here.
And after downloading, please place it in ./glove
and then run
python src/reformat_glove.py
Please run this single command to download and install all Python packages, NLTK packages, and spaCy dependencies.
bash scripts/get_prereq_ready.sh
You will see so many lines are being printed out. If no errors, go ahead to Data Preprocessing. If you are interested what have been installed to your Python environment, please finish this section.
Ruara is implement by Python and PyTorch. For a one-click complete installment of all Python dependencies. Please run
pip install -r requirements.txt
nltk
:- install
nltk
withpip
. - download
nltk
supporting corpus.>>> import nltk >>> nltk.download('punkt') >>> nltk.download('averaged_perceptron_tagger') >>> nltk.download('words')
- install
spaCy
:- install pos tagging package in shell
python3 -m spacy download en_core_web_sm
- install pos tagging package in shell
To preprocess the data, run the preprocessing.py
to do the job. Use the following command to see the help info.
python src/preprocessing.py -h
Detailed instruction for each flag
usage: preprocess.py [-h] [--test_split_ratio TEST_SPLIT_RATIO]
[--k_core K_CORE] [--min_review_len MIN_REVIEW_LEN]
[--use_spell_check] [--amazon_subset AMAZON_SUBSET]
optional arguments:
-h, --help show this help message and exit
--test_split_ratio TEST_SPLIT_RATIO
Ratio of test split to main dataset. Default=0.1.
--k_core K_CORE The number of cores of the dataset. Default=5.
--min_review_len MIN_REVIEW_LEN
Minimum num of words of the reviews. Default=5.
--use_spell_check Whether to use spell check and correction. Turning
this on will SLOW DOWN the process.
--amazon_subset AMAZON_SUBSET
[Amazon-only] Subset name of Amazon datasett
Here's an example for parsing the Digital Music dataset for Amazon.
python src/preprocess.py --amazon_subset=digital_music
Note: We noticed that there exist words which are misspelled and can damage the PMI for aspect words.
After preprocessing, we will arrive at the ASPE part. This work is done by the following steps:
- Annotate text with the NN-based model (SDNR in our example)
- Prepare the Sentiment lexicon
- Use
annotate.py
to build PMI and merge the three sets to build$ST$ . - Use
extract.py
to build AS-candidates and eventually AS-pairs.
Please see instructions for SDRN. There's a link that can teleport you back here.
The lexicon has been included in this package. Please refer to ./configs/opinion-lexicon-English
. Well, you don't need to do anything for this step.
We run annotate.py
to build PMI and merge the three sets to build annotate.py
is in charge of the following tasks:
- Compute P(w_i, w_j) and P(w_i).
- Compute PMI for each word in the corpus.
- Load sentiment terms extracted by the NN-based and Lexicon-based methods.
- Use dependency parsing technique to find aspect-sentiment pair candidates. (AS-cand)
- Save extracted AS-cand.
[HELP] Detailed instructions are below.
usage: annotate.py [-h] --path PATH --sdrn_anno_path SDRN_ANNO_PATH
[--pmi_window_size PMI_WINDOW_SIZE]
[--token_min_count TOKEN_MIN_COUNT]
[--num_senti_terms_per_pol NUM_SENTI_TERMS_PER_POL]
[--use_senti_word_list] [--glove_dimension GLOVE_DIMENSION]
[--multi_proc_dep_parsing]
[--num_workers_mp_dep NUM_WORKERS_MP_DEP]
[--do_compute_pmi]
optional arguments:
-h, --help show this help message and exit
--path PATH Path to the dataset.
--sdrn_anno_path SDRN_ANNO_PATH
Path to SDRN annotation results in `.txt`
--pmi_window_size PMI_WINDOW_SIZE
The window size of PMI cooccurance relations.
Default=5.
--token_min_count TOKEN_MIN_COUNT
Minimum token occurences in corpus. Rare tokens are
discarded. Default=20.
--num_senti_terms_per_pol NUM_SENTI_TERMS_PER_POL
Number of sentiment terms per seed. Default=300.
--use_senti_word_list
If used, sentiment word table will be used as well.
--glove_dimension GLOVE_DIMENSION
The dimension of glove to use in the PMI parsing.
Default=100.
--multi_proc_dep_parsing
If used, parallel processing of dependency parsing
will be enabled.
--num_workers_mp_dep NUM_WORKERS_MP_DEP
Number of workers to be spinned off for multiproc dep
parsing.
--do_compute_pmi Whether to redo pmi computation
[EXAMPLE] Here's an example for parsing the Digital Music dataset for Amazon.
bash scripts/run_annotate.sh digital_music
But please pay attention to the --do_compute_pmi
flag. When you first run this model, please enable this flag as it will execute the compute PMI for you. You will see below that it save the PMI terms after it run once so that next time you don't waste time running it again.
[OUTPUT] Most important results annotate.py
generates:
train_data_dep.pkl
: the pickle file of dataframe with a column containingspacy.doc
objects.pmi_senti_terms.pkl
: sentiment terms extracted by PMI methods.
We use extract.py
to filter useful aspects and convert aspects to index.
[HELP] Below is the help information
usage: extract.py [-h] --data_path DATA_PATH --count_threshold COUNT_THRESHOLD
[--run_mapping]
optional arguments:
-h, --help show this help message and exit
--data_path DATA_PATH
Path to the dataset.
--count_threshold COUNT_THRESHOLD
Threshold of the count.
--run_mapping If off, only get aspairs but do not work on df. For
viewing use, cheaper.
Please check out the count_threshold
of each dataset from the paper. --run_mapping
is a flag to turn on/off the "real" heavy work. If used, the actual filtering is on. Otherwise, it only do the count-thresholding to remove the infrequent aspects.
[EXAMPLE] Please find example for building AS-pairs for extract.py
below
bash scripts/run_extract.sh
Please note that this file takes care of all seven datasets. Make sure you want all of them, or the unwanted ones are commented out.
[OUTPUT] Most important results extract.py
generates:
train_data_aspairs.pkl
: all information needed for training.aspcat_counter.pkl
,aspcat2idx,pkl
,idx2aspcat.pkl
, andasp2aspcat.pkl
: some useful pickles that stores the aspect to ID and ID to aspects. (Implementation related only)
We are almost there!!! In order to speed up the training, we tokenize the text beforehand. We use postprocess.py
to prepare the data for training. We understand some work can be done ahead of time so that it can save sometime. Especially for finding the locations of the sentiment terms.
[HELP] Please find the useful information here.
usage: postprocess.py [-h] --data_path DATA_PATH --num_aspects NUM_ASPECTS
[--max_pad_length MAX_PAD_LENGTH]
[--num_workers NUM_WORKERS] [--build_nbr_graph]
optional arguments:
-h, --help show this help message and exit
--data_path DATA_PATH
Path to the dataset.
--num_aspects NUM_ASPECTS
Number of aspect categories in total
--max_pad_length MAX_PAD_LENGTH
Max length of padding. Default=100.
--num_workers NUM_WORKERS
Number of multithread workers
--n_partition N_PARTITION
Number of partitions for multiprocessing.
--build_nbr_graph Whether to build neighborhood graph.
Number of aspects will be printed from extract.py
.
[Example] An example for digital music is
python src/postprocess.py --data_path=./data/amazon/digital_music --num_aspects 296
Or in a bash file:
bash scripts/run_postprocess.sh
Again, make sure you want all of them, or the unwanted ones are commented out. Another tip for the execution: for larger datasets, the only way to make it runable is to set both --num_workers
and --n_partition
to be 1.
[OUTPUT] Most important results postprocess.py
generates:
user_anno_tkn_revs.pkl
and item_anno_tkn_revs.pkl
: pickle files containing tokenized IDs and attention masks for the BERT model. For details, check out the EntityReviewAggregation
class for details.
All preparing steps are done! Let's get to the training & testing part.
Everything related to training is in the train.py
file. Please run the
python src/train.py -h
to check out the configurations of experiments. For most of the arguments, the short docstrings in the help
field are long enough to be understood. We would like to mention a few arguments as below:
--task
: choose fromtrain
andboth
.train
only trains the model and save if the save model flag is on.both
will train and test the model according to the evaluation config.--experimentID
: a unique string for this experiment. You can locate an experiment run by its experiment ID. For example, the log of this run will be store as./log/[experimentID].log
in the logging directory./log/
.--eval_after_epoch_num
: you may not want to evaluate (test) on the first few epochs because testing wastes time and the model may be not ready yet. This argument does this job: the model will start testing after a certain number of epochs to save time.--disable_explicit
and--disable_implicit
: you can used these two argument to run the ablation studies in our paper. Turn on--disable_explicit
to get w/o EX and turn on--disable_implicit
to get w/o IM.
Some parameters should be aligned with the ASPE part.
--padded_length
: we set as 100 (default).--num_aspects
,--num_user
, and--num_item
: please check out the paper. It's okay to set--num_user
and--num_item
to larger values to avoid out-of-bound error.
We provide an example to train and test and Digital Music dataset in scripts/run_train.sh
.
You will be able to see the training process being printed in the console when you run the train.py
. But it can be flushed away easily. That's where log comes into use. You can find the log of a certain experiment run in directory ./log
with the name [experimentID].log
. For example, in the scripts/run_train.sh
, the experiment ID is set to "001". Then you will be able to see 001.log
in the log
dir. Below is a short segment of the log:
...
[01/19/2021 09:05:27][INFO][train.py] [Perf][Iter] ep:[8] iter:[400/460] loss:[0.4566] [2928,1835]
[01/19/2021 09:05:32][INFO][train.py] [Perf][Iter] ep:[8] iter:[420/460] loss:[0.6840] [2928,1835]
[01/19/2021 09:05:38][INFO][train.py] [Perf][Iter] ep:[8] iter:[440/460] loss:[0.9334] [2928,1835]
[01/19/2021 09:05:43][INFO][train.py] [Perf][Epoch] ep:[8] iter:[4140] avgloss:[0.648911]
[01/19/2021 09:05:52][INFO][train.py] [test] ep:[8] mse:[(0.8755816, 0.84472084)]
[01/19/2021 09:05:52][INFO][train.py] [Time] Starting Epoch 9
[01/19/2021 09:05:53][INFO][train.py] [Perf][Iter] ep:[9] iter:[0/460] loss:[0.9120] [2928,1835]
[01/19/2021 09:05:58][INFO][train.py] [Perf][Iter] ep:[9] iter:[20/460] loss:[0.5846] [2928,1835]
[01/19/2021 09:06:04][INFO][train.py] [Perf][Iter] ep:[9] iter:[40/460] loss:[0.7867] [2928,1835]
...
Details:
- Lines with
[Iter]
are training status printed in iterations. We set--log_iter_num
to 20 so the difference between two print-outs. - Lines with
[Epoch]
are the status for the whole epoch including an average loss. - Lines with
[test]
are testing performances and they are what we reported. If you want to only see the test performances digged out from the whole dump of log, just dopython src/parse_log.py [experimentID]
and you'll see the logs on testing only. The two numbers aftermse
are unclamped loss and the clamp loss, respectively.
If --save_model
is on and --save_epoch_num
and --save_after_epoch_num
are properly configured, you'll be able to find the checkpoint s in ./ckpt/
directory (or the path you specify in --save_model_path
. As these are only the weights of the model, you can restore them by
import torch
from model import APRE
# load args
model = APRE(args)
model.load_state_dict(torch.load(args.load_model_path))
We use a separate session to talk about SDRN
, a Bert-based model for aspect and sentiment co-extraction. We carefully record the procedure for reproduction. NOTE: If trained models are already available, please jump to step 8!
-
Clone the repo from GitHub: https://github.com/NKU-IIPLab/SDRN. Many thanks for sharing the code! Put it here
[this repo]/extractors/SDRN
-
Install PyTorch 0.4.1.
-
Install the package
pytorch_pretrained_bert
. (I know it might be outdated by theSDRN
implementation was actually based on it.) -
Download Bert checkpoint and config files from here. Note that the
.bin
(checkpoint) and the.json
(config) have to match! Add the locations of them tomain.py
. -
The
modeling.py
didn't come with the original repo ofSDRN
. Please find it from here. -
Do some changes following below instructions:
- Make some changes in
main.py
andopinionMining.py
, all of which is related tofrom bert.modeling
.to# main.py from bert.modeling import BertConfig from bert.optimization import BERTAdam
And# main.py from pytorch_pretrained_bert.modeling import BertConfig from pytorch_pretrained_bert.optimization import BertAdam as BERTAdam
to# opinionMining.py from bert.modeling import BertModel, BERTLayerNorm
assuming that# opinionMinding.py from modeling import BertModel, BERTLayerNorm
modeling.py
has been put to the right position. - There's was a bug in
_load_from_state_dict
in the repo, can be fixed easily. - Another problem that I encountered was the
.gamma
and.beta
ofBertLayerNorm
. Laster fixed it by finding the originalmodeling.py
.
- Make some changes in
-
Train the model with the given datasets: 2014Lap.pt, 2014Res.pt, 2015Res.pt. Using the folliwing script:
# run one corpus by corpus [in SDRN dir]$ bash scripts/train_sdrn.sh [dataset] [No. of epochs] # run everything [in SDRN dir]$ bash scripts/train_all.sh # e.g. [in SDRN dir]$ bash scripts/train_sdrn.sh 2014Res 5
Below are the number of epochs I used to train SDRN.
Name 2014Lap 2014Res 2015Res #. Ep 5 10 8 -
[If trained SDRN models are available, start right from this step!] Massage our data into SDRN-compatible format and run inference (annotation). We wrote a Python script to do the work using the preprocessed Amazon data. Note that it takes a long time to run.
[in SDRN dir]$ bash scripts/run_inference.sh
Detailed parameters within
run_inference.sh
.[in SDRN dir]$ python ruara_evaluate.py [do_process] [training data] [annotate subset] [head] [gpu_id]
The semantic of parameters:
- to_process: True/False, whether to rerun formatting Amazon data to SDRN data
- training_set: The dataset that trains the SDRN model
- annotate subset: The Amazon subset to process
- head: Positive number: number of top lines of Amazon data to process; Negative number: process the whole dataset.
- gpu_id: the GPU to use.
-
Parse the output annotation file. Please run the following command to merge sentiments extracted from the three SDRN versions.
# Change the dataset names in this file correspondingly! [in SDRN dir] $ bash scripts/parse_output.sh
Details:
[in SDRN dir]$ python parse_output.py [task] [training set] [annotate subset]
The definition of the parameters are the same as those in point 8 except
task
which can beparse
andmerge
.For
parse
, the output will be-
./data/anno_[train_set]_[subset]/aspect_terms.pkl
: the aspect terms list pickle. -
./data/anno_[train_set]_[subset]/sentiment_terms.pkl
: the sentiment terms list pickle. For example,./data/anno_2014Lap_digital_music/aspect_terms.pkl
saves all aspect terms extracted by a 2014Lap-trained SDRN model for the datasetdigital_music
.
For
merge
, take output from the above step and merge the sentiment terms sets. Produce results to./data/senti_term_[subset]_merged.pkl
. -
-
Until here, the SDRN term extraction is done. The generated file can be picked up by
annotate.py
. Click here to jump back.