Tips about the evaluation on MUSIC-solo dataset

Question

Tips about the evaluation on MUSIC-solo dataset

Closed this issue 3 years ago · 21 comments

There are some tips when evaluating the model on MUSIC-solo dataset in Stage One:

training for 10~15 epoch.
when evaluating the localization performance, select the model that aftering localization, e.g. the name of pth file is "location_cluster_net_0xx_xxxx_av_local.pth".
For the training of stage two, select the model after classification of stage one as pretrain, e.g. the name of pth file is "location_cluster_net_0xx_xxxx_av_class.pth"

Answer 1 · 2021-11-08T07:07:08.000Z

Hi, I have a problem about evaluating the model on MUSIC-solo dataset of Stage One:
In test.py 160~170:
# net setup
visual_backbone = resnet18(modal='vision', pretrained=True)
audio_backbone = resnet18(modal='audio')
av_model = Location_Net_stage_two(visual_net=visual_backbone, audio_net=audio_backbone)
if args.use_pretrain:
PATH = os.path.join('ckpt/stage_two_cosine2/', args.ckpt_file)
state = torch.load(PATH)
av_model.load_state_dict(state)
print(PATH)
av_model_cuda = av_model.cuda()
Does this mean I should train stage two when I want to test the performance of stage one?
In my opinion, on MUSIC-solo dataset, training stage one is enough.

Answer 2 · 2021-11-08T07:23:58.000Z

Hi, thanks for your question. You are right, please use this file to evalute your model at stage one:

python3 training_stage_one.py --mode test --use_pretrain 1 --ckpt_file your_ckpt_file_path
python3 tools.py

Thanks again. We have modified this mistake.

Answer 3 · 2021-11-09T14:47:14.000Z

Thanks for your answer.
But my IoU and AUC is lower than yours.
I think there are somethings wrong in the dataset which is gained on the Internet by myself. I find there are some pairs lost. Could you share the MUSIC-solo dataset to me?
Another question is in training_one_stage.py, the random seed is not used which make each train or test result much different.
Looking for your help. Thanks!

Answer 4 · 2021-11-10T06:35:02.000Z

Sorry that we did not preserve the origin videos. Perhaps it will be helpful to check the above tips.

The experiment results are affected by the random seed is common phenomenon. All our experiments are based on the same seed, so the comparison is fair.

Answer 5 · 2021-11-10T06:50:06.000Z

Fine, for the same location_cluster_net_0xx_xxxx_av_local.pth (i test in 9 epoch), the result of each test by running 'training_one_stage.py --mode test' and 'tools.py' is quite different. The IoU is range between 47 and 50 and the auc is about 40.5. I think the gap is too large. And I also test in different epochs from 7 to 14. The results are worse. What is the problem?
Does it means that i should set the same seed for test? How to do that? And while training, should the random seed be the same? Should the seeds of train and test be the same?
I have tried to add 'random.seed(args.seed)' in the training_one_stage.py in front of the line of ‘train_dataset = MUSIC_Dataset(args.data_dir, train_list_file, args)’ but it does not work. The training results and test results are still different.
I am comfused for the whole day.

Answer 6 · 2021-11-10T07:30:12.000Z

I check the code and find the released version loss the part that set seed. You can use this:
def setup_seed(seed):
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

insert 'setup_seed(args.seed)' before line 331, training_stage_one.py

In test stage, it does not need to randomly initialize the network parameter, so the seed actually have no effect.

Answer 7 · 2021-11-11T10:12:50.000Z

Thanks very much for your answer.
I think the seed will affect the test result as each time the model samples some frames to test by random.choice(). So i also set random.seed(seed) and np.random.seed(seed).
But now i still can not get a good result. I follow the above tips, training for 15 epoch and testing the 'location_cluster_net_0xx_xxxx_av_local.pth'.
I train the model in seed 0,5,10 and choose the best one(seed=5).
I have tested between 007 and 014. The results are as follows:
007: AUC: 37.6 IoU: 41
008: AUC: 40.7 IoU: 50.6
009: AUC: 40.2 IoU: 45
010: AUC: 40.3 IoU: 47.1
011: AUC: 36.7 IoU: 26.4
012: AUC: 37.5 IoU: 34.6
013: AUC: 37.8 IoU: 32
014: AUC: 38.2 IoU: 42.8
As you can see, the results are lower than yours and IoU will drop down a lot after 8 epoch. The results are TOO UNSTABLE. The results of seed=0, 10 are much lower.
What is the problem do you think?
I run the experiment on one 2080ti GPU and the other configs are not changed.
pytorch=1.1.0, torchvision=0.2.1, scikit-learn=0.24.2, librosa=0.8.1, Pillow=8.2.0, opencv=3.4.2

Answer 8 · 2021-11-11T10:27:00.000Z

The objective of the training is "match or not"/"classification", which is not directly related to localization. So the lower training loss does NOT means better localization results.

Answer 9 · 2021-11-11T10:38:58.000Z

yep i get it. What i want to do is set your model as the baseline model for music-solo task. Do you mean that the unstable model is normal even in your previous experiment so the model may be not suitable for the one-object localization? Insteadly, it is a part for the muti-object localization task.

Answer 10 · 2021-11-11T11:04:22.000Z

yeah, that is a good question. In the past experiment, we also find the unstable phenomenon, which may be a problem worth exploring.

Answer 11 · 2021-11-11T11:07:02.000Z

OK, thank you sooo much!

Answer 12 · 2021-11-17T08:17:10.000Z

Hi, I have found a problem in the MUSIC dataset.
In your solotest.json, all the video frames are sampled as 0000.jpg, 0007.jpg, 0014.jpg and so on, which means all the videos are 30 fps.
However, some videos are 25 fps and these videos will be sampled every 6 frames , like 0000.jpg, 0006.jpg..., by running the cut_videos.py. These makes lots of anotations in the solotest.json will not be used.
Does it means i should set the fps be 30 instead of fps = vid.get(cv2.CAP_PROP_FPS)?
BUT if in this way, the audio segment will not map with the video segment since the length of the audio segment is 1 second.
I have read the another issue which is the same as my question but there is no answer.

Answer 13 · 2021-12-09T03:03:34.000Z

Hello,I have some question about training in stage one~
Q1、The "cluster" parameter is not in the Class Location_Net_stage_one function

      In training_stage_one.py:
      line:330-333
      # net setup
      visual_backbone = resnet18(modal='vision', pretrained=True)
      audio_backbone = resnet18(modal='audio')
      av_model = Location_Net_stage_one(visual_net=visual_backbone, audio_net=audio_backbone, **cluster=args.cluster**)
      
      In music-exp/model/location_modal.py:
      line 4-7:
      class Location_Net_stage_one(nn.Module):
      def init(self, visual_net, audio_net):
      super(Location_Net_stage_one, self).init()

Q2、I process video and audio through cut_audios.py and cut_video.py
when I train the training_stage_one.py, the audio_data.shape is 3 dimension and img_data.shape is 4 dimension
But the value of the 4 dimension of audio is required during training （audio_data.shape[3]）
Do u know what can I do to solve this problem?

    audio_data.shape torch.Size([1, 1, 16000])
    posi_img_data.shape torch.Size([1, 3, 224, 224])

In training_stage_one .py
line:30-40：
batch_audio_data = torch.zeros(audio_data.shape[0] * 2, audio_data.shape[1], audio_data.shape[2],
audio_data.shape[3])
batch_image_data = torch.zeros(posi_img_data.shape[0] * 2, posi_img_data.shape[1], posi_img_data.shape[2],
posi_img_data.shape[3])
batch_labels = torch.zeros(audio_data.shape[0] * 2)

Excuse me~
I’m also currently doing some research about multi-modal learning.
If it's convenient for you, can you add a wechat with me？
Nice to meet u~
My wechat: 13011105988

Answer 14 · 2021-12-09T04:43:11.000Z

Did u set the fps be 30 for all video?
I have the same problem now

Answer 15 · 2021-12-09T08:01:24.000Z

Excuse me~
Did you process audio through cut_audio.py?
I find in the cut_audio.py , there are many lines was noted
and did not get the (201x64 )Log Mel spectrogram

Answer 16 · 2021-12-09T08:52:48.000Z

Hi, after prepocessing, for each 1s clip, we can get a 201x64 mel, and several images.

For each 1s clip, during training, we randomly select one image as the visual input.
during testing, we select the first image as the visual input.

Hence, the fps is not vital.

Answer 17 · 2021-12-09T08:54:11.000Z

Please uncomment line 30~41, and save "log_mel_T" in line 43 for cut_audio.py.

Answer 18 · 2021-12-09T09:02:42.000Z

Thanks for your response~
I am also studying the related algorithms of audio-visual multimodality。
I look forward to reproducing this paper
Can I add your vx~~
My tel:13011105988

Answer 19 · 2021-12-09T09:12:24.000Z

Should I save "log_mel_T" as .pkl or .jpg ?

Answer 20 · 2021-12-09T09:17:48.000Z

pkl.

It is better to read the paper/code/instruction carefully, and if you still have any further problem, please send me an email.

Answer 21 · 2021-12-09T09:19:20.000Z

Thanks u very much~