Due to a lack of GPU machines, this work has been delayed for several months. As a result, only a limited number of ablation experiments are included in the paper. We plan to supplement with more experiments when resources become available in the future.
Create virtual environment from configuration file
conda env create --file configs/mmsrec.yml
Activate virtual environment
conda activate mmsrec
Install and download CLIP (used as feature extractor)
sh weights/clip/download.sh
Go to /dataset/webvid/preprocess/
directory
Install git-lfs
sudo apt-get install git-lfs
Download dataset
git lfs clone https://huggingface.co/datasets/iejMac/CLIP-WebVid.git
Extract dataset
sh download.sh
Generate training files
python process_item.py
python process_seq.py
Go to dataset/msrvtt/preprocess
directory
Download dataset
sh download.sh
Generate training files
python process_item.py
python process_seq.py
Go to dataset/amazon/preprocess
directory
Download dataset (This will download Beauty, Sports, Clothing, and Home datasets. You can modify and adjust according to your needs)
sh download.sh
Scrape image links from the dataset and generate training files
python process_item.py
Feature extraction
python extract_features.py
Go to dataset/movielens-1m/preprocess
directory
Download dataset
sh download.sh
Scrape video data
python download_videos.py
Generate training files
python process_item.py
Feature extraction
python extract_features.py
Execute the following command to start pretraining
sh pretrain_webvid.sh
Note: The script assumes each node has 8 GPUs by default. You can modify the following parameter for custom configuration
--nproc_per_node=8 \
The configuration file for pretraining is configs/pretraining/pretrain_webvid.yaml
Finetune the pre-trained model on Amazon
sh finetune_amazon.sh
The configuration file is configs/pretraining/finetune_amazon.yaml
Finetune the pre-trained model on Movielens-1M
sh finetune_movielens.sh
The configuration file is configs/pretraining/finetune_movielens.yaml