linjieli222/HERO

got error while running pretrain.py

liveseongho opened this issue · 5 comments

Hi,

I'm trying to reproduce pretraining with config pretrain-tv-ht-16gpu.json

I got error messages as follows:

[1,4]<stderr>:Traceback (most recent call last):
[1,4]<stderr>:  File "pretrain.py", line 619, in <module>
[1,4]<stderr>:    main(args)
[1,4]<stderr>:  File "pretrain.py", line 175, in main
[1,4]<stderr>:    train_loaders, val_loaders = build_target_loaders(target, t_r, opts)
[1,4]<stderr>:  File "pretrain.py", line 59, in build_target_loaders
[1,4]<stderr>:    target['vfeat_interval'], opts)
[1,4]<stderr>:  File "/src/load_data.py", line 37, in load_video_sub_dataset
[1,4]<stderr>:    if "msrvtt" in opts.tasks:
[1,4]<stderr>:AttributeError: 'Namespace' object has no attribute 'tasks'

So I printed that opts.

[1,4]<stdout>:Namespace(betas=[0.9, 0.98], checkpoint='/pretrain/pretrain-tv-init.bin', compressed_db=False, drop_svmr_prob=0.8, dropout=0.1, fp16=True, grad_norm=1.0, gradient_accumulation_steps=2, hard_neg_weights=[10], hard_negtiave_start_step=[20000], hard_pool_size=[20], img_db='/video', learning_rate=3e-05, load_partial_pretrained=True, lr_mul=1.0, lw_neg_ctx=8.0, lw_neg_q=8.0, lw_st_ed=0.01, margin=0.1, mask_prob=0.15, max_clip_len=100, max_txt_len=60, model_config='config/hero_pretrain.json', n_gpu=6, n_workers=1, num_train_steps=1650000, optim='adamw', output_dir='pt-temp', pin_mem=True, ranking_loss_type='hinge', save_steps=500, seed=77, skip_layer_loading=True, sub_ctx_len=0, targets=[{'name': 'tv', 'sub_txt_db': 'tv_subtitles.db', 'vfeat_db': 'tv', 'vfeat_interval': 1.5, 'splits': [{'name': 'all', 'tasks': ['mlm', 'mfm-nce', 'fom', 'vsm'], 'train_idx': 'pretrain_splits/tv_train.json', 'val_idx': 'pretrain_splits/tv_val.json', 'ratio': [2, 2, 1, 2]}]}, {'name': 'ht100_full_filtered', 'sub_txt_db': 'howto100m_pretrain_all_60s_clip_sub.db', 'vfeat_db': 'howto100m_pretrain_all_60s_clips', 'vfeat_shards': ['howto100m_pretrain_all_clips_8', 'howto100m_pretrain_all_clips_0', 'howto100m_pretrain_all_clips_1', 'howto100m_pretrain_all_clips_2', 'howto100m_pretrain_all_clips_3', 'howto100m_pretrain_all_clips_4', 'howto100m_pretrain_all_clips_5', 'howto100m_pretrain_all_clips_6', 'howto100m_pretrain_all_clips_7', 'howto100m_pretrain_all_clips_9'], 'vfeat_interval': 2.0, 'splits': [{'name': 'all', 'tasks': ['mfm-nce', 'fom'], 'train_idx': ['howto100_full_pretrain_split/ht100_full_filtered_train_8.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_0.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_1.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_2.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_3.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_4.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_5.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_6.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_7.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_9.json'], 'val_idx': 'howto100_full_pretrain_split/ht100_full_filtered_val.json', 'ratio': [2, 1]}, {'name': 'has-sub', 'tasks': ['mlm', 'vsm'], 'train_idx': ['howto100_full_pretrain_split/ht100_full_filtered_train_8.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_0.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_1.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_2.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_3.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_4.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_5.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_6.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_7.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_9.json'], 'val_idx': 'howto100_full_pretrain_split/ht100_full_filtered_val.json', 'ratio': [2, 2]}]}], targets_ratio=[1, 9], train_batch_size=32, train_span_start_step=0, txt_db='/txt', use_all_neg=True, val_batch_size=32, valid_steps=5000, vfeat_interval=1.5, vfeat_version='resnet_slowfast', warmup_steps=10000, weight_decay=0.01)

I think sub_txt_db should be SubTokLmdb, but it's not..? I'm not sure. How should I fix this issue?

HERO/load_data.py

Lines 36 to 40 in 00d8fbf

if not isinstance(sub_txt_db, SubTokLmdb):
if "msrvtt" in opts.task:
sub_txt_db = VrSubTokLmdb(sub_txt_db, opts.max_clip_len)
else:
sub_txt_db = SubTokLmdb(sub_txt_db, opts.max_clip_len)

I can bypass this error message when I ignore L37-L39 and run L40.

Here is another issue

f"{target['vfeat_db']}/{shard}", sub_txt_db,

should be modified to f"{opts.img_db}/{target['vfeat_db']}/{shard}", sub_txt_db, ?

@liveseongho Thanks for pointing out the issue and sorry about the inconvenience.

I will push a fix soon.

Let me know if you running into other issues. We have not got a chance to test the full pre-training after code refactor.

@linjieli222

HERO/utils/save.py

Lines 21 to 23 in 1174d6a

def save_training_meta(args):
if args.rank > 0:
return

I got error messages as follows:

[1,0]<stderr>:  File "pretrain.py", line 618, in <module>
[1,0]<stderr>:    main(args)
[1,0]<stderr>:  File "pretrain.py", line 243, in main
[1,0]<stderr>:    save_training_meta(opts)
[1,0]<stderr>:  File "/src/utils/save.py", line 23, in save_training_meta
[1,0]<stderr>:    if args.rank > 0:
[1,0]<stderr>:AttributeError: 'Namespace' object has no attribute 'rank'

I printed args

Namespace(betas=[0.9, 0.98], checkpoint='/pretrain/pretrain-tv-init.bin', compressed_db=False, drop_svmr_prob=0.8, dropout=0.1, fp16=True, grad_norm=1.0, gradient_accumulation_steps=2, hard_neg_weights=[10], hard_negtiave_start_step=[20000], hard_pool_size=[20], img_db='/video', learning_rate=3e-05, load_partial_pretrained=True, lr_mul=1.0, lw_neg_ctx=8.0, lw_neg_q=8.0, lw_st_ed=0.01, margin=0.1, mask_prob=0.15, max_clip_len=100, max_txt_len=60, model_config='config/hero_pretrain.json', n_gpu=8, n_workers=1, num_train_steps=1650000, optim='adamw', output_dir='pt-temp', pin_mem=True, ranking_loss_type='hinge', save_steps=500, seed=77, skip_layer_loading=True, sub_ctx_len=0, targets=[{'name': 'tv', 'sub_txt_db': 'tv_subtitles.db', 'vfeat_db': 'tv', 'vfeat_interval': 1.5, 'splits': [{'name': 'all', 'tasks': ['mlm', 'mfm-nce', 'fom', 'vsm'], 'train_idx': 'pretrain_splits/tv_train.json', 'val_idx': 'pretrain_splits/tv_val.json', 'ratio': [2, 2, 1, 2]}]}, {'name': 'ht100_full_filtered', 'sub_txt_db': 'howto100m_pretrain_all_60s_clip_sub.db', 'vfeat_db': 'howto100m_pretrain_all_60s_clips', 'vfeat_shards': ['howto100m_pretrain_all_clips_8', 'howto100m_pretrain_all_clips_0', 'howto100m_pretrain_all_clips_1', 'howto100m_pretrain_all_clips_2', 'howto100m_pretrain_all_clips_3', 'howto100m_pretrain_all_clips_4', 'howto100m_pretrain_all_clips_5', 'howto100m_pretrain_all_clips_6', 'howto100m_pretrain_all_clips_7', 'howto100m_pretrain_all_clips_9'], 'vfeat_interval': 2.0, 'splits': [{'name': 'all', 'tasks': ['mfm-nce', 'fom'], 'train_idx': ['howto100_full_pretrain_split/ht100_full_filtered_train_8.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_0.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_1.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_2.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_3.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_4.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_5.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_6.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_7.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_9.json'], 'val_idx': 'howto100_full_pretrain_split/ht100_full_filtered_val.json', 'ratio': [2, 1]}, {'name': 'has-sub', 'tasks': ['mlm', 'vsm'], 'train_idx': ['howto100_full_pretrain_split/ht100_full_filtered_train_8.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_0.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_1.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_2.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_3.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_4.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_5.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_6.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_7.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_9.json'], 'val_idx': 'howto100_full_pretrain_split/ht100_full_filtered_val.json', 'ratio': [2, 2]}]}], targets_ratio=[1, 9], train_batch_size=32, train_span_start_step=0, txt_db='/txt', use_all_neg=True, val_batch_size=32, valid_steps=5000, vfeat_interval=1.5, vfeat_version='resnet_slowfast', warmup_steps=10000, weight_decay=0.01)

How should I fix this issue?

@liveseongho

I have updated the utils/save.py to command out L23-24.

Please check if it works now.

It works.

Thanks!