microsoft/UniVL

Weights from pretrained model not used in UniVL in evaluation. In EVALUATION, there is lack of visual_pytorch_model.bin, cross_pytorch_model.bin, decoder_pytorch_model.bin in visual-base, cross-base , decoder-base

lokeaichirou opened this issue · 12 comments

当我基于预训练的weights进行evaluation时, when I do the evaluation based on pre-trained weights
我遇到了如下问题: I meet issues below:

  • INFO - Weight doesn't exsits. /content/visual-base/visual_pytorch_model.bin
    ......
  • INFO - Weight doesn't exsits. /content/cross-base/cross_pytorch_model.bin
    ......
  • INFO - Weight doesn't exsits. /content/decoder-base/decoder_pytorch_model.bin
  • WARNING - Stage-One:True, Stage-Two:False
  • WARNING - Set bert_config.num_hidden_layers: 12.
  • WARNING - Set visual_config.num_hidden_layers: 6.
  • INFO - --------------------
  • INFO - Weights from pretrained model not used in UniVL:

eval_epoch()没有实际运行就结束了. The eval_epoch() does not execute actually and the program finishes.

There is lack of visual_pytorch_model.bin, cross_pytorch_model.bin, decoder_pytorch_model.bin in visual-base, cross-base , decoder-base on Github page.
在主页上的visual-base, cross-base , decoder-base文件夹里不存在visual_pytorch_model.bin, cross_pytorch_model.bin, decoder_pytorch_model.bin

@lokeaichirou Sorry for the confusion about the INFO Weight doesn't exsits, which is useless information and needs to ignore. These bin files visual_pytorch_model.bin, cross_pytorch_model.bin, decoder_pytorch_model.bin are not contained in the program and all pretrained weights are contained in univl.pretrained.bin. They will not influent the execution and results.

For the question that The eval_epoch() does not execute actually and the program finishes. Are there any errors? Or can you provide a full log for your running?

@lokeaichirou Sorry for the confusion about the INFO Weight doesn't exsits, which is useless information and needs to ignore. These bin files visual_pytorch_model.bin, cross_pytorch_model.bin, decoder_pytorch_model.bin are not contained in the program and all pretrained weights are contained in univl.pretrained.bin. They will not influent the execution and results.

For the question that The eval_epoch() does not execute actually and the program finishes. Are there any errors? Or can you provide a full log for your running?

Hi, there are no errors reported in evaluation. For the 'action' argument setting, I set them as parser.set_defaults(do_pretrain=False, do_train=False, do_eval=True) for evaluation (only evaluation based on pre-trained weights) I attach my log.txt below.

@lokeaichirou Sorry for the confusion about the INFO Weight doesn't exsits, which is useless information and needs to ignore. These bin files visual_pytorch_model.bin, cross_pytorch_model.bin, decoder_pytorch_model.bin are not contained in the program and all pretrained weights are contained in univl.pretrained.bin. They will not influent the execution and results.

For the question that The eval_epoch() does not execute actually and the program finishes. Are there any errors? Or can you provide a full log for your running?
Hi, this is my log.text
log.txt

@lokeaichirou Sorry for the confusion about the INFO Weight doesn't exsits, which is useless information and needs to ignore. These bin files visual_pytorch_model.bin, cross_pytorch_model.bin, decoder_pytorch_model.bin are not contained in the program and all pretrained weights are contained in univl.pretrained.bin. They will not influent the execution and results.

For the question that The eval_epoch() does not execute actually and the program finishes. Are there any errors? Or can you provide a full log for your running?

And I tried to set 'action argument' stage-two to be True, then it can enter the step of 'for batch in test_dataloader', however, it reports error in this step with 'RuntimeError: DataLoader worker (pid(s) 1643) exited unexpectedly'.

RuntimeError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _try_get_data(self, timeout)
985 try:
--> 986 data = self._data_queue.get(timeout=timeout)
987 return (True, data)

11 frames
/usr/lib/python3.7/multiprocessing/queues.py in get(self, block, timeout)
103 timeout = deadline - time.monotonic()
--> 104 if not self._poll(timeout):
105 raise Empty

/usr/lib/python3.7/multiprocessing/connection.py in poll(self, timeout)
256 self._check_readable()
--> 257 return self._poll(timeout)
258

/usr/lib/python3.7/multiprocessing/connection.py in _poll(self, timeout)
413 def _poll(self, timeout):
--> 414 r = wait([self], timeout)
415 return bool(r)

/usr/lib/python3.7/multiprocessing/connection.py in wait(object_list, timeout)
920 while True:
--> 921 ready = selector.select(timeout)
922 if ready:

/usr/lib/python3.7/selectors.py in select(self, timeout)
414 try:
--> 415 fd_event_list = self._selector.poll(timeout)
416 except InterruptedError:

/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/signal_handling.py in handler(signum, frame)
65 # Python can still get and update the process status successfully.
---> 66 _error_if_any_worker_fails()
67 if previous_handler is not None:

RuntimeError: DataLoader worker (pid 1643) is killed by signal: Killed.

The above exception was the direct cause of the following exception:

RuntimeError Traceback (most recent call last)
in ()
69 if args.local_rank == 0:
70 print('DO EVALUATION')
---> 71 Bleu_4 = eval_epoch(args, model, test_dataloader, tokenizer, device, n_gpu, nlgEvalObj=nlgEvalObj)
72 print('EVALUATION ENDS')

in eval_epoch(args, model, test_dataloader, tokenizer, device, n_gpu, nlgEvalObj, test_set)
14
15 print('START EVALUATION!\n')
---> 16 for batch in test_dataloader:
17 batch = tuple(t.to(device, non_blocking=True) for t in batch)
18

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in next(self)
515 if self._sampler_iter is None:
516 self._reset()
--> 517 data = self._next_data()
518 self._num_yielded += 1
519 if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
1180
1181 assert not self._shutdown and self._tasks_outstanding > 0
-> 1182 idx, data = self._get_data()
1183 self._tasks_outstanding -= 1
1184 if self._dataset_kind == _DatasetKind.Iterable:

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _get_data(self)
1146 else:
1147 while True:
-> 1148 success, data = self._try_get_data()
1149 if success:
1150 return data

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _try_get_data(self, timeout)
997 if len(failed_workers) > 0:
998 pids_str = ', '.join(str(w.pid) for w in failed_workers)
--> 999 raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
1000 if isinstance(e, queue.Empty):
1001 return (False, None)

RuntimeError: DataLoader worker (pid(s) 1643) exited unexpectedly

@lokeaichirou It seems that something wrong with multiprocessing in the Dataloader. Can you test the below command?
The different is replacing --do_train --num_thread_reader=4 with --do_eval --num_thread_reader=0. --num_thread_reader is used to set the number of subprocessors. Besides, --stage_two should be true when captioning.

python -m torch.distributed.launch --nproc_per_node=4 main_task_caption.py --do_eval --num_thread_reader=0 --epochs=5 --batch_size=16 --n_display=100 --train_csv ${TRAIN_CSV} --val_csv ${VAL_CSV} --data_path ${DATA_PATH} --features_path ${FEATURES_PATH} --output_dir ${OUTPUT_ROOT}/ckpt_youcook_caption --bert_model bert-base-uncased --do_lower_case --lr 3e-5 --max_words 128 --max_frames 96 --batch_size_val 64 --visual_num_hidden_layers 6 --decoder_num_hidden_layers 3 --stage_two --init_model ${INIT_MODEL}

@lokeaichirou It indeed needs to take up a lot of GPU memory. You can reduce the batch size, e.g., --batch_size_val 64 -> --batch_size_val 8, or reduce the token length --max_words 128 --max_frames 96 to find a trade-off.

@lokeaichirou It seems that something wrong with multiprocessing in the Dataloader. Can you test the below command?
The different is replacing --do_train --num_thread_reader=4 with --do_eval --num_thread_reader=0. --num_thread_reader is used to set the number of subprocessors. Besides, --stage_two should be true when captioning.

python -m torch.distributed.launch --nproc_per_node=4 main_task_caption.py --do_eval --num_thread_reader=0 --epochs=5 --batch_size=16 --n_display=100 --train_csv ${TRAIN_CSV} --val_csv ${VAL_CSV} --data_path ${DATA_PATH} --features_path ${FEATURES_PATH} --output_dir ${OUTPUT_ROOT}/ckpt_youcook_caption --bert_model bert-base-uncased --do_lower_case --lr 3e-5 --max_words 128 --max_frames 96 --batch_size_val 64 --visual_num_hidden_layers 6 --decoder_num_hidden_layers 3 --stage_two --init_model ${INIT_MODEL}

Hi, @ArrowLuo , I followed your suggestion, setting num_thread_reader=0. It works for evaluation now based on pre-trained weights provided! Many thanks. I will try with training later, could it be based on pre-trained weights as well? I will check with you on here if any issue for training stage. Thanks again!

@lokeaichirou It indeed needs to take up a lot of GPU memory. You can reduce the batch size, e.g., --batch_size_val 64 -> --batch_size_val 8, or reduce the token length --max_words 128 --max_frames 96 to find a trade-off.

Yes, I reduced them, it finally works. Thanks!

@lokeaichirou It indeed needs to take up a lot of GPU memory. You can reduce the batch size, e.g., --batch_size_val 64 -> --batch_size_val 8, or reduce the token length --max_words 128 --max_frames 96 to find a trade-off.

Hi, @ArrowLuo , and may I ask you another basic question? Since the youcookii_data.no_transcript.pickle, youcookii_val.csv and youcookii_videos_features.pickle are all feed into the testing dataloader, in the paper, and in the captioning evaluation performance table, it is written that the input form could be single V, or single T, or V+T, may I ask what kind of formation is it for input by default argument and based on dataloader_youcook setting:

youcook_testset = Youcook_Caption_DataLoader(
        csv=args.val_csv,
        data_path=args.data_path,
        features_path=args.features_path,
        max_words=args.max_words,
        feature_framerate=args.feature_framerate,
        tokenizer=tokenizer,
        max_frames=args.max_frames,
    )
test_sampler = SequentialSampler(youcook_testset)
    dataloader_youcook = DataLoader(
        youcook_testset,
        sampler=test_sampler,
        batch_size=args.batch_size_val,
        num_workers=args.num_thread_reader,
        pin_memory=False,
    )

Because the youcookii_data.no_transcript.pickle has no transcript (replaced by 'none'), the input form is single V. We control the input type V or T with masked T or V, respectively. So the single V, or single T, and V+T share the same DataLoader.

Because the youcookii_data.no_transcript.pickle has no transcript (replaced by 'none'), the input form is single V. We control the input type V or T with masked T or V, respectively. So the single V, or single T, and V+T share the same DataLoader.

ok, I see. Many thanks!

Because the youcookii_data.no_transcript.pickle has no transcript (replaced by 'none'), the input form is single V. We control the input type V or T with masked T or V, respectively. So the single V, or single T, and V+T share the same DataLoader.

ok, I see. Many thanks!

您好,可以请教下您如何下载youcookii数据集的原始视频吗?谢谢!