(Feature request) Batched feature extraction
christian-matroid opened this issue · 18 comments
Hello, thank you for releasing the code and great work!
Is there a way to increase batch size in the simple feature extraction examples? The current script only utilizes about 7gb of vram during feature extraction.
Hello! Due to the varying length of each video, parallel processing may be troublesome and require large memory for temporary features. We may not support this feature for the time being.
Hello! Thank you so much for your response! I have two follow up questions if it won't take too much time.
may be troublesome and require large memory for temporary features
If I have access to a large amount of memory, could it be as simple as just increasing the batch and reshaping these in the input tensor batch dimension?
# ...
#add some "batch_size" integer argument
for start_idx in tqdm.tqdm(start_idx_range(len(vr)), position=1, leave=0, desc=vid_name):
data = vr.get_batch(np.arange(start_idx, start_idx + 16 * args.batch_size)).asnumpy()
tensor_data = torch.from_numpy(data).cuda() # Size([16*bs, 566, 320, 3])
tdq = transform(tensor_data).unsqueeze(0) # Size([3, 16*bs, 224, 224])
tdq = torch.reshape(tdq, (args.batch_size, tdq.shape[1], 16, *tdq.shape[3:]))
with torch.no_grad():
batched_feature = model.forward_features(tdq)
feature_list.extend(feature.cpu().numpy() for feature in batched_feature)
# ...
Additionally, when I try to load the pretrained Vit-L backbone architecture, I get numerous parameter mismatches. Is there an additional parameter I need to change to use VideoMAE (v1)
model zoo models?
# ask to initialize pretrained backbone from VideoMAE model zoo
print(args.model) # vit_large_patch16_224
model = create_model(
args.model,
img_size=224,
pretrained=False,
# num_classes=710,
all_frames=16,
tubelet_size=2,
drop_path_rate=0.3,
use_mean_pooling=True,
)
ckpt = torch.load(args.ckpt_path, map_location="cpu")
for model_key in ["model", "module"]:
if model_key in ckpt:
ckpt = ckpt[model_key]
break
model.load_state_dict(ckpt) #ERRORS HERE. Output is a very long torch model parameter mismatch string
There is something wrong with the code you wrote for extracting features with a larger batch size.
Specifically, after the transform, the feature shape is [3, bs * 16, 224, 224]
. This should be followed by tdq = rearrange(tdq, 'c (b t) h w -> b c t h w', b=bs, t=16)
. Also, your code does not consider the case where the video length is not divisible by the batch size * 16
. It is recommended that if you are not sure, you still use the original code to extract the features.
For the second problem, you should remove the prefix encoder.
from the model key, as in line 587:
VideoMAEv2/run_class_finetuning.py
Lines 582 to 591 in 9492db0
In addition, using a pre-trained model without the fine-tuning supervision of high semantic hard labels for TAD tasks can be very ineffective. So please do not use this model for feature extraction.
Thank you for the quick response.
It is recommended that if you are not sure, you still use the original code to extract the features.
Thank you. I will keep this in mind.
In addition, using a pre-trained model without the fine-tuning supervision of high semantic hard labels for TAD tasks can be very ineffective.
I see. I was attempting to follow the method mentioned in the Downstream: Temporal Action Localization readme in the InternVideo
repository and the InternVideo Paper (section 4.3.1). The hosted features shared there worked very well with the ActionFormer head, and I am trying to replicate their performance by extracting features on my own custom data. If additional fine-tuning was used, can you explain what that might have been?
InternVideo also uses fine-tuned models. In fact, VideoMAE v2 and internvideo's TAD task were done by the same guy
InternVideo also uses fine-tuned models
So if I want to extract features on a custom dataset with VideoMae, I should be fine-tuning a backbone model on that custom dataset, then performing feature extraction?
InternVideo also uses fine-tuned models
So if I want to extract features on a specific dataset, I should be fine-tuning a backbone model on the dataset, then performing feature extraction?
I'm not sure, but the model finetuned on K710 should perform best (you need not perform extra supervision on your custom dataset)
Hello @congee524. Thanks so much for your help so far! I've reiterated my remaining questions a little more concisely on this InternVideo issue as it is more relevant and visible. If you know more about the fine-tuning/TAL feature extraction process I would be incredibly grateful if you responded. Thanks again!
hybrid pretrain -> k710 finetune -> extract tad features -> actionformer finish task
As far as I know this is the case
@congee524 Thank you for replying. I used the ViT-Giant model finetuned on K710 to perform feature extraction on Thumos, and the exact configs (with only the input dimension changed) hosted on the internvideo github to benchmark the features with ActionFormer.
I was not able to reproduce the same results as the pre-extracted features (I got 44.77% AMaP versus the reported 71.58%). Perhaps the model used for feature extraction is finetuned on Thumos directly?
The information is too limited for me to tell what went wrong. We have put out our own extracted features in TAD.md, perhaps you can use it to check if there is a problem with the features you extracted.
@congee524 Thank you for pointing me to these hosted TAD features. I was able to reproduce your results with these features as well. 🎉
To check whether my feature extraction is performing as intended, I performed extraction with extract_tad_feature.py
and a fresh installation of VideoMAEv2 using the vit_g_hybrid_pt_1200e_k710_ft.pth
weights downloaded from the model weight links document that was shared with me.
python extract_tad_feature.py \
--data_set THUMOS14 \
--data_path raw_data/thumos14_videos/test_selection \
--save_path sample_data/test_selection \
--model vit_giant_patch14_224 \
--ckpt_path models/vit_g_hybrid_pt_1200e_k710_ft.pth
I compared my features with the hosted TAD features, and I get different features than the ones hosted. The video I used to compare is video_test_0000556.mp4
, downloaded directly from the official thumos dataset.
shape vit_g_k710 extracted: (504, 1408)
shape vit_g_k710 hosted: (504, 1408)
Difference between features for video_validation_000556.npy:
total abs diff: 263538.15625
mean abs diff: 0.3713729977607727
std diff: 0.5049932599067688
Perhaps the model weights are different or your features were extracted with a different script?
Thanks for the info! I'll recheck the extraction script in a few days.
Hi @congee524, thank you so much for your help so far. Have you had a chance to look at the feature extraction script?
Kindly bumping this again.
Kindly bumping this again.
Sorry for my late reply, I have been rather busy recently. I briefly checked the features earlier and didn't see a problem. Do the features you extracted yourself and the features we released have the same shape?
Hi @congee524, my features are of the same shape (both in frame number and dimension) but have different values. I've uploaded a few examples for direct comparison to this drive link, as well as the raw video data and the model weights of vit_g_hybrid_pt_1200e_k710_ft.pth
I used for feature extraction. Let me know if you'd like me to remove the model weights.
@congee524 Hello,have you successfully run the code of VideoMAE V2 ? I want to finetune it with my own dataset but I have met some difficulities. I would appreciate it if you can give me some advice!