This is the official implementation for Collaborative Word-based Pre-trained Item Representation for Transferable Recommendation.
python==3.9.7
recbole==1.0.1
torch==1.10.0
cudatoolkit==11.3
transformers==4.21.2
We provide a preprocessed dataset of Scientific
dataset in dataset/Scientific
. You can download the pretrained model from Google Drive and place it in saved/model
. Then You can repreduce our experiment results following the steps below.
Git clone this repository.
git clone https://github.com/ysh-1998/CoWPiRec.git
Get the text-based item embedding with CoWPiRec.
python get_emb.py --gpu_id=0 --dataset Scientific
Train and evaluate on downstream datasets Scientific
.
python downstream/finetune.py --gpu_id=0 -d Scientific
The pipeline to repreduce CoWPiRec is shown below.
To preprocessing datasets, you should prepare tow raw_data files:
- user interaction data
data_name.csv
, format:user_id,item_id,time
- item metadata
meta_data_name.csv
, format:item_id,item_text
Place these two files in ./dataset/raw_data/interaction
and ./dataset/raw_data/metadata
, respectively, then run script below:
python ./dataset/preprocessing/process_dataset.py --dataset dataset_name
python ./dataset/preprocessing/tokenizer.py --dataset dataset_name
The dataset path generated contains:
dataset/
dataset_name/
data_name.train.inter
data_name.dev.inter
data_name.test.inter
data_name.text
data_name.tokenize.json
data_name.item2index
data_name.user2index
The word graph is a key component in pre-training of CoWPiRec and is constructed based on co-click items. Obtain the co-click item pairs first.
python ./WordGraph/get_coclick_item_pairs.py --dataset dataset_name
Process co-click item pairs to get co-click word graph.
python ./WordGraph/get_coclick_word_graph.py \
--dataset dataset_name --num_workers 40 --max_len 64
You can perform multiprocessing by modifing the argument --number_workers
. The max length of item text is controlled by the argument --max_len
.
Process co-click word graph, filter neighbors based on tf-idf.
python ./WordGraph/get_word_graph.py --dataset dataset_name --topN 30
argument --topN
denotes the number of neighbors after filtering.
Pre-train on single GPU.
python pretrain.py --gpu_id=0 -d pretrain_dataset_name
Pre-train with multi GPUs.
CUDA_VISIBLE_DEVICES=0,1,2,3 python ddp_pretrain.py
Get item embedding using CoWPiRec.
python get_emb.py --gpu_id=0 --dataset dataset_name \
--load_pretrain_model --pretrain_model_path saved/xxx.pth
Downstream recommendation.
python downstream/finetune.py --gpu_id=0 -d dataset_name