M3Sum

[Title] M3Sum: A Novel Unsupervised Language-guided Video Summarization

Preparation

Clone the repo to your local.
Download Python version: 3.7.12
Open the shell or cmd in this repo folder. Run this command to install necessary packages.

pip install -r requirements.txt

Change the value api_key in the code to your OpenAI api key.
Download the videos from this link, and the extraction code is 1234. Put the "videos" folder into the specific datasets.

Experiments

Inference: You can input the following command to train the model. There are different choices for some hyper-parameters shown in square barckets. The meaning of these parameters are shown in the following tables.

predict_with_chatgpt.py or predict_with_chatgpt_by_CoT.py

Parameters	Value	Description
data	string	Different data for video summarization. You can choose "caption, transcript, transcript2caption"
batch_size	int	Number of frames for inference one time

cd ./TVSum
python predict_with_chatgpt.py \
    --data caption \
    --batch_size 120 \

python predict_with_chatgpt.py \
    --data transcript \
    --batch_size 30 \

python predict_with_chatgpt.py \
    --data transcript2caption \
    --batch_size 20 \

Evaluation: After predicting the scores of the frames, we need to calculate the F1 scores of the prediction results. You can input the following command to evaluate the model. There are different choices for some hyper-parameters shown in square barckets. The meaning of these parameters are shown in the following tables.

evaluate_video_summarization.py

Parameters	Value	Description
data	string	Different data for video summarization. You can choose "caption, transcript, transcript2caption, merge"
pcot	int	Whether to use the prediction results of progressive CoT
merge_mode	string	Different alignment metrics. You can choose "ppl, bertscore, bleu"
threshold	float	The threshold values for alignment

P.S. When ''data'' parameter is set to "merge", the frame scores of "caption" and "transcript2caption" are merged by different alignment metrics. This mode is the alignment module in our paper.

cd ./TVSum

% evaluate the results of standard prompting
python evaluate_video_summarization.py \
    --data caption \

python evaluate_video_summarization.py \
    --data transcript \

python evaluate_video_summarization.py \
    --data transcript2caption \

% evaluate the results of progressive CoT
python evaluate_video_summarization.py \
    --data caption \
    --pcot 1 \

python evaluate_video_summarization.py \
    --data transcript \
    --pcot 1 \

python evaluate_video_summarization.py \
    --data transcript2caption \
    --pcot 1 \

% evaluate the results of standard prompting with alignment
python evaluate_video_summarization.py \
    --data merge \
    --merge_mode bertscore \
    --threshold 0.6 \

% evaluate the results of progressive CoT with alignment
python evaluate_video_summarization.py \
    --data merge \
    --pcot 1 \
    --merge_mode bertscore \
    --threshold 0.6 \

ZovanZhou/M3Sum

M3Sum

Preparation

Experiments