OFA-Sys/OFA

How to train VQA on my custom data?

xiaoqiang-lu opened this issue · 11 comments

Hello! I am trying to finetune OFA-large on VQA using custom dataset, using the finetuning instruction in the repo. I have checked my .tsv and .pkl file several times and they are correct as your provided sample. But after command "bash train_vqa_distributed.sh", the terminal just prints:

total_num_updates 40000
warmup_updates 1000
lr 5e-5
patch_image_size 480

The GPU usage will rise to a certain value and then suddenly return to zero, and then the program will end. I train on single server with 2 GPU. Looking forward to reply, thanks for your sharing work!

Hi, could you please provide the exact script you run on your machine and the information of your GPU-cards type? I will have a check on my environment.

Moreover, for fine-tuning on customed VQA-formated data, please also refer to this recent issue for more information #76.

Thanks for your reply! At first I was using two cards 3080ti, now I replaced them with 4 cards v100, however the same problem occurs. The script on my machine:

GPUS_PER_NODE=4
WORKER_CNT=1
export MASTER_ADDR=127.0.0.1
export MASTER_PORT=8214
export RNAK=0

The rest are unchanged. I also make my own ans2label.pkl file.
Here is a part of my .tsv file without imgbase64.
image
Here is a part of my .pkl file.
image

Hi, have you checked the path of $log_file defined in your training script? The running log is saved in this file rather than printed on stdout. The program may be ended for other reasons, which may be recorded in the log. Please share more information if you find this log file.

Thanks! It seems to be a problem with my image that is causing this, I am using the code you replied to in issue #56 for imgbase64.
image

I have solved the above problem, but another problem occurs.
image

Hi, please check whether the fields of the input data line which caused this error correspond with the specified selected_cols. By default, the selected_cols is specified as 0,5,2,3,4 in the script, which sequentially fetches the 0th (uniq_id), 5th (image), 2nd (question), 3rd (answer info), 4th (predict_objects) field from each input TSV line. If any of the field mismatches, errors may occur.

I have check the input data line, and it is same as exsample. I print the column_l and the length of it, column_l is correct [img_id, imgbase64, question, answer, objects].
image

Hi, I think there is a misunderstanding of how each data line is organized. As mentioned in the readme, in each line in TSV file, the fields follow the exact order of question-id, image-id, question, answer (with confidence), predicted object labels and image base64 string, thus there are 6 fields in total in the TSV file (also the image-id field is not used). By specifying the selected_cols=0,5,2,3,4, the program sequentially fetches the 0th (question-id), 5th (image), 2nd (question), 3rd (answer info), 4th (predict_objects) field from each input TSV line, resulting in a sample to be further processed in __getitem__ method of VqaGenDataset.

By the way, for preparing the dataset TSV file, I would also recommend to prepare an original training sample with more than one golden answers into multiple samples each of which contains only one of the answers. This will take full advantage of the supervision of ground-truth answers of training samples. Otherwise, only the golden answer with the highest confidence score will be used as supervision.

Thanks! It seems to be a problem with my image that is causing this, I am using the code you replied to in issue #56 for imgbase64. image

how you resolve this problem? I''m having same problem. Thanks