Overview This is an implementation of the Joint Representation Learning Model (JRLM) for product recommendation based on heterogeneous information sources [2]. Please cite the following paper if you plan to use it for your project: Yongfeng Zhang, Qingyao Ai, Xu Chen, W. Bruce Croft. 2017. "Joint Representation Learning for Top-N Recommendation with Heterogeneous Information Sources". In Proceedings of CIKM ’17. The JRL is a deep neural network model that jointly learn latent representations for products and users based on reviews, images and product ratings. The model can jointly or independently latent representations for products and users based on different information. The probability (which is also the rank score) of an product being purchased by a user can be computed with their concatenated latent representations from different information sources. Please refer to the paper for more details. Requirements o To run the JRL model in ./JRL/ and the python scripts in ./scripts/, python 2.7+ and Tensorflow v1.0+ are needed. o To run the jar package in ./jar/, JDK 1.7 is needed. o To compile the java code in ./java/, Galago from the Lemur Project is needed. (https://sourceforge.net/p/lemur/wiki/Galago%20Installation/) Data Preparation o Download Amazon review datasets from http://jmcauley.ucsd.edu/data/amazon/. In our paper, we used 5-core data. o Stem and remove stop words from the Amazon review datasets if needed. In our paper, we stem the field of “reviewText” and “summary” without stop word removal. java -Xmx4g -jar ./jar/AmazonReviewData_preprocess.jar <jsonConfigFile> <review_file> <output_review_file> where <jsonConfigFile> A JSON file that specify the file path of stop words list. An example can be found in the root directory. Enter “false” if you don’t want to remove stop words. <review_file> The path for the original Amazon review data. <output_review_file> The output path for processed Amazon review data. o Index datasets python ./scripts/index_and_filter_review_file.py <review_file> <indexed_data_dir> <min_count> where <review_file> The file path for the Amazon review data. <indexed_data_dir> The output directory for indexed data. <min_count> The minimum count for terms. If a term appears less then <min_count> times in the data, it will be ignored. o Split train/test -- Download the meta data from http://jmcauley.ucsd.edu/data/amazon/ -- Split datasets for training and test python ./scripts/split_train_test.py <indexed_data_dir> <review_sample_rate> where <indexed_data_dir> The directory for indexed data. <review_sample_rate> The proportion of reviews used in test for each user. In our paper, we used 0.3. -- Match image features + Download the image features from http://jmcauley.ucsd.edu/data/amazon/ . + Match image features with product ids. python ./scripts/match_with_image_features.py <indexed_data_dir> <image_feature_file> where <indexed_data_dir> The directory for indexed data. <image_feature_file> The file for image features data. -- Match rating features + Construct latent representations based on rating information with any method you like (e.g. BPR). + Format the latent factors of items and users in "item_factors.csv" and "user_factors.csv" such that each row represents one latent vector for the corresponding item/user in the <indexed_data_dir>/product.txt.gz and user.txt.gz. See example csv files. + Put the item_factors.csv and user_factors.csv into <indexed_data_dir>. Model Training/Testing python ./JRL/main.py --<parameter_name> <parameter_value> --<parameter_name> <parameter_value> … where parameter names and values include: learning_rate The learning rate in training. Default 0.05. learning_rate_decay_factor Learning rate decays by this much whenever the loss is higher than three previous losses. Default 0.90. max_gradient_norm Clip gradients to this norm. Default 5.0. subsampling_rate The rate to subsampling. Default 1e-4. L2_lambda The lambda for L2 regularization. Default 0.0. image_weight The weight for image feature based training loss. See the paper for more details. batch_size Batch size used in training. Default 64. data_dir Data directory, which should be the <indexed_data_dir>. input_train_dir The directory of training and testing data, which usually is <data_dir>/query_split/ train_dir Model directory and output directory similarity_func The function to compute the ranking score for an item with the joint model of query and user embeddings. Default “product”. Available functions include: “product” The dot product of two vectors. “cosine” The cosine similarity of two vectors. “bias_product” The dot product plus a item-specific bias. net_struct Network structure parameters. Different parameters are separated by “_”. Default “simplified_fs”. Network structure parameters include: “bpr” Train models in a bpr framework [1]. “simplified” Simplified embedding-based language models without modeling for each review [2]. “hdc” Use regularized embedding-based language models with word context [4]. Otherwise, use the default model, which is the embedding-based language models based on paragraph vector model. [3] “extend” Use the extendable model structure. See more details in the paper. “text” Use review data. “image” Use image data. "rate" Use rating-based latent representations. Note: If none of "text", "image" and "rate" is specified, the model will use all of them. embed_size Size of each embedding. Default 100. window_size Size of context window for hdc model. Default 5. max_train_epoch Limit on the epochs of training (0 means no limit). Default 5. steps_per_checkpoint How many training steps to do per checkpoint. Default 200. seconds_per_checkpoint How many seconds to wait before storing embeddings. Default 3600. negative_sample How many samples to generate for negative sampling. Default 5. decode Set to “False" for training and “True" for testing. Default “False". test_mode Test modes. Default “product_scores". Test modes include the following: “product_scores” Output ranking results and ranking scores. “output_embedding" Output embedding representations for users, items and words. rank_cutoff Rank cutoff for output rank lists. Default 100. Evaluation o After training with "--decode False”, generate test rank lists with "--decode True”. o TREC format rank lists for test data will be stored in <train_dir> with name “test.<similarity_func>.ranklist” o Evaluate test rank lists with ground truth <input_train_dir>/test.qrels. python recommendation_metric.py <rank_list_file> <test_qrel_file> <rank_cutoff_list> where <rank_list_file> The result list, e.g. <train_dir>/test.<similarity_func>.ranklist <test_qrel_file> The ground truth, e.g. <input_train_dir>/test.qrels <rank_curoff_list> The number of top documents to used in evaluation, e.g. NDCG@10 -> rank+cutoff_list=10. References [1] Ste en Rendle, C. Freudenthaler, Zeno Gantner and Lars Schmidtieme. 2009. "BPR: Bayesian personalized ranking from implicit feedback". In UAI. [2] Yongfeng Zhang, Qingyao Ai, Xu Chen, W. Bruce Croft. 2017. "Joint Representation Learning for Top-N Recommendation with Heterogeneous Information Sources". In Proceedings of CIKM ’17. [3] Quoc V Le and Tomas Mikolov. 2014. "Distributed Representations of Sentences and Documents". In ICML. [4] Sun, Fei, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng. 2015. "Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations". In ACL. [5] Ivan Vulić and Marie-Francine Moens. 2015. "Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings". In Proceedings of the 38th ACM SIGIR.