/drem-attention

DREM model with attention mechanism

Primary LanguagePythonApache License 2.0Apache-2.0

Overview

This is an implementation of the Dynamic Relation Embedding Model (DREM) with Attention networks for better user representation in personalized product search.

The DREM is a deep neural network model that jointly learn latent representations for queries, products, users, and knowledge entities. It is designed as a generative model and the embedding representations for queries, users and items in the DREM are learned through optimizing the log likelihood of observed entity relationships. The probability (which is also the rank score) of an item being purchased by a user with a query can be computed with their corresponding latent representations. Please refer to the paper for more details.

Requirements:

1. To run the DREM-Attention model in ./ProductSearch/ and the python scripts in ./utils/, python 3.0+ and Tensorflow v1.3+ are needed. (In the paper, we used python 3.6 and Tensorflow v1.4.0)
2. To run the jar package in ./utils/AmazonDataset/jar/, JDK 1.7 is needed
3. To compile the java code in ./utils/AmazonDataset/java/, galago from lemur project (https://sourceforge.net/p/lemur/wiki/Galago%20Installation/) is needed. 

Install

Create virtual environment (optional):

pip install --user virtualenv
~/.local/bin/virtualenv -p python3 ./venv
source venv/bin/activate

Install DREM-Attention from the source:

git clone https://github.com/utahIRlab/drem-attention.git
cd drem-attention
python setup.py install #use setup-gpu.py for GPU support

Run example:

cd example/AmazonDataset/
bash exp_pipeline.sh

Data Preparation

1. Download Amazon review datasets from http://jmcauley.ucsd.edu/data/amazon/ (e.g. In our paper, we used 5-core data)
2. Stem and remove stop words from the Amazon review datasets if needed (e.g. In our paper, we stem the field of “reviewText” and “summary” without stop words removal)
    1. java -Xmx4g -jar ./utils/AmazonDataset/jar/AmazonReviewData_preprocess.jar <jsonConfigFile> <review_file> <output_review_file>
        1. <jsonConfigFile>: A json file that specify the file path of stop words list. An example can be found in the root directory. Enter “false” if don’t want to remove stop words. 
        2. <review_file>: the path for the original Amazon review data
        3. <output_review_file>: the output path for processed Amazon review data
3. Index datasets
    1. python ./utils/AmazonDataset/index_and_filter_review_file.py <review_file> <indexed_data_dir> <min_count>
        1. <review_file>: the file path for the Amazon review data
        2. <indexed_data_dir>: output directory for indexed data
        3. <min_count>: the minimum count for terms. If a term appears less then <min_count> times in the data, it will be ignored.
4. Extract queries and Split train/test
    1. Download the meta data from http://jmcauley.ucsd.edu/data/amazon/ 
    2. Match the meta data with the indexed data:
        1. java -Xmx16G -jar ./utils/AmazonDataset/jar/AmazonMetaData_matching.jar <jsonConfigFile> <meta_data_file> <indexed_data_dir>
            1. <jsonConfigFile> : A json file that specify the file path of stop words list. An example can be found in the root directory. Enter “false” if don’t want to remove stop words. 
            2. <meta_data_file>:  the path for the meta data
            3. <indexed_data_dir>: the directory for indexed data
    3. Collect time sequence information from user's purchase history:
        1. python3 ./utils/AmazonDataset/collect_time_seq_info.py <indexed_data_dir> <jsonConfigFile>
            1. <indexed_data_dir>: the directory for indexed data
            2. <jsonConfigFile> : A json file that specify the file path of stop words list. An example can be found in the root directory.
    4. Split datasets for training and test
        1. python ./utils/AmazonDataset/random_split_train_test_data.py <indexed_data_dir> <review_sample_rate> <query_sample_rate>
        2. <indexed_data_dir>: the directory for indexed data
        3. <review_sample_rate>: the proportion of reviews used in test for each user (e.g. in our paper, we used 0.3).
        4. <query_sample_rate>: the proportion of queries used in test (e.g. in our paper, we used 0.3).

Model Training/Testing

1. python ./ProductSearch/main.py --<parameter_name> <parameter_value> --<parameter_name> <parameter_value> … 
    1. learning_rate:  The learning rate in training. Default 0.5.
    2. learning_rate_decay_factor: Learning rate decays by this much whenever the loss is higher than three previous loss. Default 0.90
    3. max_gradient_norm: Clip gradients to this norm. Default 5.0
    4. subsampling_rate: The rate to subsampling. Default 1e-4. 
    5. L2_lambda: The lambda for L2 regularization. Default 0.0
    6. query_weight: The weight for queries in the joint model of queries and users. Default 0.5
    7. batch_size: Batch size used in training. Default 64
    8. data_dir: Data directory, which should be the <indexed_data_dir>
    9. input_train_dir: The directory of training and testing data, which usually is <data_dir>/query_split/
    10. train_dir: Model directory & output directory
    11. similarity_func: The function to compute the ranking score for an item with the joint model of query and user embeddings. Default “product”.
        1. “product”: the dot product of two vectors.
        2. “cosine”: the cosine similarity of two vectors.
        3. “bias_product”: the dot product plus a item-specific bias
    12. net_struct:  Network structure parameters. Different parameters are separated by “_” (e.g. ). Default “simplified_fs”
        1. “ZAM”: the zero attention model proposed by Ai et al. [3]
        2. “LSE”: the latent space entity model proposed by Gysel et al. [1]
        3. “simplified”: simplified embedding-based language models without modeling for each review [2]
        4. “mean”: average word embeddings for query embeddings [5]
        5. “fs”: average word embeddings with non-linear projection for query embeddings [1]
        6. “RNN”: recurrent neural network encoder for query embeddings
    13. embed_size: Size of each embedding. Default 100.
    14. window_size: Size of context window for hdc model. Default 5.
    15. max_train_epoch: Limit on the epochs of training (0: no limit). Default 5.
    16. steps_per_checkpoint: How many training steps to do per checkpoint. Default 200
    17. seconds_per_checkpoint: How many seconds to wait before storing embeddings. Default 3600
    18. negative_sample: How many samples to generate for negative sampling. Default 5.
    19. decode: Set to “False" for training and “True" for testing. Default “False"
    20. test_mode: Test modes. Default “product_scores"
        1. “product_scores”: output ranking results and ranking scores; 
        2. “output_embedding": output embedding representations for users, items and words.
        3. “explain": start interactive explanation mode. Specify product, user, and query id to find the nearest neighbors of each entity in different entity space. Read interactive_explain_mode() in ./ProductSearch/main.py for more information.
        4. "explanation_path": generate explanation paths for all user-query-product pairs in the test batch. 
    21. rank_cutoff: Rank cutoff for output rank lists. Default 100.
    22. explanation_output_dir: Output directory for generated explanations. Provide only when test_mode is 'explanation_path'.
    23. max_history_length: Max number of products used by the model from user's purchase history. Default 20
2. Evaluation
    1. After training with "--decode False”, generate test rank lists with "--decode True”.
    2. TREC format rank lists for test data will be stored in <train_dir> with name “test.<similarity_func>.ranklist”
    3. Evaluate test rank lists with ground truth <input_train_dir>/test.qrels using trec_eval or galago eval tool.

Example train & test scripts

Train model

 python ./ProductSearch/main.py --data_dir=<data-dir> --input_train_dir=<input-train-dir> --min_count 5 
 --learning_rate 0.5 --max_train_epoch 20 --embed_size 100 --subsampling_rate 1e-4 --L2_lambda 0.005 
 --batch_size 64 --window_size 3 --negative_sample 5 --rank_cutoff 100 --similarity_func 'bias_product' 
 --query_weight 0.5 --train_dir <train-dir>

Test & generate ranklist

 python ./ProductSearch/main.py --data_dir=<data-dir> --input_train_dir=<input-train-dir> --min_count 5 
 --learning_rate 0.5 --max_train_epoch 20 --embed_size 100 --subsampling_rate 1e-4 --L2_lambda 0.005 
 --batch_size 64 --window_size 3 --negative_sample 5 --rank_cutoff 100 --similarity_func 'bias_product' 
 --query_weight 0.5 --train_dir <train-dir> --decode True

Test & generate explanations

 python ./ProductSearch/main.py --data_dir=<data-dir> --input_train_dir=<input-train-dir> --min_count 5 
 --learning_rate 0.5 --max_train_epoch 20 --embed_size 100 --subsampling_rate 1e-4 --L2_lambda 0.005 
 --batch_size 64 --window_size 3 --negative_sample 5 --rank_cutoff 100 --similarity_func 'bias_product' 
 --query_weight 0.5 --train_dir <train-dir> --decode True --test_mode explanation_path --explanation_output_dir <explanation-output-dir>

Generating data for Mturk survey

Run the model with test mode as "explanation_path" to obtain explanations for all user-query-review pairs from the dataset. The generated explanations can be found in the output CSV file path that you provided while running the model. This CSV file contains the following fields: user, query, product, explanation, previous_reviews.

Likewise, run the DREM model with test mode "explanation_path" to obtain explanations for all user-query-review pairs from the dataset. The generated explanations can be found in the output CSV file path that you provided while running the model. Now, merge the explanations from both these CSV files to generate a single csv file for Mturk.

Merge explanations:

python utils/merge_explanations.py <your drem csv file name> <your drem attention csv file name> <output file>

This merged output CSV file should be provided as input to web_scrapper.py to scrape product title, image and description. The web scrapper generates a CSV file with following fields: sample_id, user, query, product, drem_explanation, drem_attn_explanation, previous_reviews, title, image, description. This output file is present in utils/mturk-batch-input.csv.

Scrapper usage:

python utils/web_scapper.py <your input file name>

The generated mturk-batch-input.csv file is to be uploaded to Amazon MTurk as a batch input file. The setup for MTurk survey is present in this documentation. The MTurk UI design for the survey is available here.

Citation

If you use these data in your research, please cite with the following BibTex entry.

@misc{ai2021modelagnostic,
      title={Model-agnostic vs. Model-intrinsic Interpretability for Explainable Product Search}, 
      author={Qingyao Ai and Lakshmi Narayanan Ramasamy},
      year={2021},
      eprint={2108.05317},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}