houzhijian/GroundNLQ

RuntimeError: Given groups=1, weight of size [384, 2304, 3], expected input[2, 256, 2560] to have 2304 channels, but got 256 channels instead

TousakaNagio opened this issue · 1 comments

Hi,

I followed the instruction to download the video features and convert them to lmdb,
however, when I ran the pretrain script, this runtimeerror occured.

RuntimeError: Given groups=1, weight of size [384, 2304, 3], expected input[2, 256, 2560] to have 2304 channels, but got 256 channels instead

Would you please help to deal with this problem?
Thank you every much.

Hi, the actual visual feature for network input is the concatenation of EgoVLP (dimension:256), InternVideo-Verb (dimension:1024) and InternVideo-Noun (dimension:1024) features. In total, the overall dimension is 2304. For your case, you might only use EgoVLP features.