how to recreate the result

Question

how to recreate the result

WannaSir opened this issue 6 months ago · 4 comments

the result above is got using the command under the directory /TIM/recognition? and the result is got using which flag '--validate' or '--extract_feats' ? and i have no idea how to use the flag of '--validate' or '--extract_feats' to recreate the result, because the README.md file only provide simple command as the following show:

Could you provide more detailed command ,so i can recreate the result using your pretrained_models.

Answer 1 · 2024-07-07T08:50:59.000Z

Hi,

Yes, those results are achieved under the recognition folder using the validation flag, which will provide accuracy results as shown in the paper.

The extract features flag will simple extract the classification logits for each action and saves them to a dictionary.

As the ReadME mentions, the way this is done is by changing the detailed training commands above that section in the same ReadME and changing the --train flag to --validate, as well as adding the path to the pre-trained model with the --pretrained_model arg. The command would look like this:

python scripts/run_net.py \
--validate \
--output_dir /path/to/output \
--video_data_path /path/to/AVE_visual_features \
--video_train_action_pickle /path/to/AVE_train_annotations \
--video_val_action_pickle /path/to/AVE_validation_annotations \
--video_train_context_pickle /path/to/AVE_visual_feature_intervals \
--video_val_context_pickle /path/to/AVE_validation_visual_feature_intervals \
--visual_input_dim <channel-size-of-visual-features> \
--audio_data_path /path/to/AVE_audio_features \
--audio_train_action_pickle /path/to/AVE_train_annotations \
--audio_val_action_pickle /path/to/AVE_validation_annotations \
--audio_train_context_pickle /path/to/AVE_train_audio_feature_intervals \
--audio_val_context_pickle /path/to/AVE_audio_feature_intervals \
--audio_input_dim <channel-size-of-audio-features> \
--video_info_pickle /path/to/AVE_video_metadata \
--dataset ave \
--feat_stride 2 \
--feat_gap 0.2 \
--num_feats 25 \
--feat_dropout 0.1 \
--seq_dropout 0.1 \
--d_model 256 \
--apply_feature_pooling False \
--lr 5e-4 \
--lambda_audio 1.0 \
--lambda_drloc 0.1 \
--mixup_alpha 0.5 \
--include_verb_noun False \
--pretrained_model /path/to/pretrained_model

So the command is identical to the training command in the same ReadME, but with 2 changes. Hope this helps!

Answer 2 · 2024-07-07T10:58:36.000Z

—video_data_path is the path of val_data?

…

2024年7月7日 16:51，Jacob Chalk ***@***.***> 写道： Hi, Yes, those results are achieved under the recognition folder using the validation flag, which will provide accuracy results as shown in the paper. The extract features flag will simple extract the classification logits for each action and saves them to a dictionary. As the ReadME mentions, the way this is done is by changing the detailed training commands above that section in the same ReadME and changing the --train flag to --validate, as well as adding the path to the pre-trained model with the --pretrained_model arg. The command would look like this: python scripts/run_net.py \ --validate \ --output_dir /path/to/output \ --video_data_path /path/to/AVE_visual_features \ --video_train_action_pickle /path/to/AVE_train_annotations \ --video_val_action_pickle /path/to/AVE_validation_annotations \ --video_train_context_pickle /path/to/AVE_visual_feature_intervals \ --video_val_context_pickle /path/to/AVE_validation_visual_feature_intervals \ --visual_input_dim <channel-size-of-visual-features> \ --audio_data_path /path/to/AVE_audio_features \ --audio_train_action_pickle /path/to/AVE_train_annotations \ --audio_val_action_pickle /path/to/AVE_validation_annotations \ --audio_train_context_pickle /path/to/AVE_train_audio_feature_intervals \ --audio_val_context_pickle /path/to/AVE_audio_feature_intervals \ --audio_input_dim <channel-size-of-audio-features> \ --video_info_pickle /path/to/AVE_video_metadata \ --dataset ave \ --feat_stride 2 \ --feat_gap 0.2 \ --num_feats 25 \ --feat_dropout 0.1 \ --seq_dropout 0.1 \ --d_model 256 \ --apply_feature_pooling False \ --lr 5e-4 \ --lambda_audio 1.0 \ --lambda_drloc 0.1 \ --mixup_alpha 0.5 \ --include_verb_noun False \ --pretrained_model /path/to/pretrained_model So the command is identical to the training command in the same ReadME, but with 2 changes. Hope this helps! — Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AWSYHF5W4JBFJA2H64O4L43ZLD6QTAVCNFSM6AAAAABKPDEYFSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJSGM3TINJWHA>. You are receiving this because you authored the thread.

Answer 3 · 2024-07-07T11:06:45.000Z

I mean --validate means ‘—video_data_path’ is the path of validation feature, but the accuracy in paper Table 4, is obtained using AVE Test Set.

…

2024年7月7日 18:58，沐乡文 ***@***.***> 写道： —video_data_path is the path of val_data? > 2024年7月7日 16:51，Jacob Chalk ***@***.***> 写道： > > > Hi, > > Yes, those results are achieved under the recognition folder using the validation flag, which will provide accuracy results as shown in the paper. > > The extract features flag will simple extract the classification logits for each action and saves them to a dictionary. > > As the ReadME mentions, the way this is done is by changing the detailed training commands above that section in the same ReadME and changing the --train flag to --validate, as well as adding the path to the pre-trained model with the --pretrained_model arg. The command would look like this: > > python scripts/run_net.py \ > --validate \ > --output_dir /path/to/output \ > --video_data_path /path/to/AVE_visual_features \ > --video_train_action_pickle /path/to/AVE_train_annotations \ > --video_val_action_pickle /path/to/AVE_validation_annotations \ > --video_train_context_pickle /path/to/AVE_visual_feature_intervals \ > --video_val_context_pickle /path/to/AVE_validation_visual_feature_intervals \ > --visual_input_dim <channel-size-of-visual-features> \ > --audio_data_path /path/to/AVE_audio_features \ > --audio_train_action_pickle /path/to/AVE_train_annotations \ > --audio_val_action_pickle /path/to/AVE_validation_annotations \ > --audio_train_context_pickle /path/to/AVE_train_audio_feature_intervals \ > --audio_val_context_pickle /path/to/AVE_audio_feature_intervals \ > --audio_input_dim <channel-size-of-audio-features> \ > --video_info_pickle /path/to/AVE_video_metadata \ > --dataset ave \ > --feat_stride 2 \ > --feat_gap 0.2 \ > --num_feats 25 \ > --feat_dropout 0.1 \ > --seq_dropout 0.1 \ > --d_model 256 \ > --apply_feature_pooling False \ > --lr 5e-4 \ > --lambda_audio 1.0 \ > --lambda_drloc 0.1 \ > --mixup_alpha 0.5 \ > --include_verb_noun False \ > --pretrained_model /path/to/pretrained_model > So the command is identical to the training command in the same ReadME, but with 2 changes. Hope this helps! > > — > Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AWSYHF5W4JBFJA2H64O4L43ZLD6QTAVCNFSM6AAAAABKPDEYFSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJSGM3TINJWHA>. > You are receiving this because you authored the thread. >

Answer 4 · 2024-07-08T04:31:43.000Z

there is my output.log, 1. i can not get the accuracy in the paper: 2. why the program finish after Epoch 1 3. the log do not show the accuracy with the combination of audio+video [07/08 12:23:32] checkpoint.py - L 12: Loading Model from Path: /data/nvme2/dky/lc/TIM/recognition/pretrained_models/ave.tar [07/08 12:23:32] loader.py - L 13: Creating test loader for modality: audio_visual [07/08 12:23:32] sliding_window.py - L 65: Constructing dataset for split : test [07/08 12:23:32] sliding_window.py - L 94: Caching Features [07/08 12:23:32] sliding_window.py - L128: Loading visual data [07/08 12:23:33] sliding_window.py - L139: Loading audio data [07/08 12:23:34] sliding_window.py - L102: Creating Windows [07/08 12:23:34] sliding_window.py - L307: Loading precomputed windows from precomputed_windows/AVE_test_AVE_test_win_25_0.4_10.0_1.0.pth [07/08 12:23:34] sliding_window.py - L110: Test Sliding Window dataset constructed. Total Actions: 8040 Number of 10.0 Second Windows: 402 Max actions in window: 40 Visual: 20 Audio: 20 Min Query Size: 1.0 Max Query Size: 1.0 Avg. Queries per Window: 20 [07/08 12:23:36] test.py - L221: | Epoch: [1][1/7] | Time: 1.931 | Data: 0.984 | Net: 0.942 | Visual Views Seen: 640 | Visual Loss: 2.0298 | Audio Views Seen: 640 | Audio Loss: 2.0318 | RAM: 17.03/503.51GB | GPU: 0.18/47.54GB | [07/08 12:23:36] test.py - L232: Epoch 1 Results: ========================================== Visual Views Seen: 4020 ------------------------------------------ Visual Action ***@***.*** 77.910 Visual Action ***@***.*** 97.687 ------------------------------------------ Visual Loss 1.65146 ========================================== Audio Views Seen: 4020 ------------------------------------------ Audio ***@***.*** 78.159 Audio ***@***.*** 97.363 ------------------------------------------ Audio Loss 1.65366 ========================================== Actions Seen: 8040 ==========================================

…

2024年7月7日 16:51，Jacob Chalk ***@***.***> 写道： Hi, Yes, those results are achieved under the recognition folder using the validation flag, which will provide accuracy results as shown in the paper. The extract features flag will simple extract the classification logits for each action and saves them to a dictionary. As the ReadME mentions, the way this is done is by changing the detailed training commands above that section in the same ReadME and changing the --train flag to --validate, as well as adding the path to the pre-trained model with the --pretrained_model arg. The command would look like this: python scripts/run_net.py \ --validate \ --output_dir /path/to/output \ --video_data_path /path/to/AVE_visual_features \ --video_train_action_pickle /path/to/AVE_train_annotations \ --video_val_action_pickle /path/to/AVE_validation_annotations \ --video_train_context_pickle /path/to/AVE_visual_feature_intervals \ --video_val_context_pickle /path/to/AVE_validation_visual_feature_intervals \ --visual_input_dim <channel-size-of-visual-features> \ --audio_data_path /path/to/AVE_audio_features \ --audio_train_action_pickle /path/to/AVE_train_annotations \ --audio_val_action_pickle /path/to/AVE_validation_annotations \ --audio_train_context_pickle /path/to/AVE_train_audio_feature_intervals \ --audio_val_context_pickle /path/to/AVE_audio_feature_intervals \ --audio_input_dim <channel-size-of-audio-features> \ --video_info_pickle /path/to/AVE_video_metadata \ --dataset ave \ --feat_stride 2 \ --feat_gap 0.2 \ --num_feats 25 \ --feat_dropout 0.1 \ --seq_dropout 0.1 \ --d_model 256 \ --apply_feature_pooling False \ --lr 5e-4 \ --lambda_audio 1.0 \ --lambda_drloc 0.1 \ --mixup_alpha 0.5 \ --include_verb_noun False \ --pretrained_model /path/to/pretrained_model So the command is identical to the training command in the same ReadME, but with 2 changes. Hope this helps! — Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AWSYHF5W4JBFJA2H64O4L43ZLD6QTAVCNFSM6AAAAABKPDEYFSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJSGM3TINJWHA>. You are receiving this because you authored the thread.