The COCO training command and GPU explosion issue when doing inference after training

Question

The COCO training command and GPU explosion issue when doing inference after training

Closed this issue 9 months ago · 1 comments

Hi Mengyao,

Thank you so much for releasing the code on GitHub! I am trying to replicate your result on COCO, but there is only the command for VOC in the README file.

Could you give me the exact command you have used to run the COCO experiments? I am trying to use

SEED=2022 QUERY_UNIT=box INIT_NUM=50000 ADD_NUM=20000 TRAIN_STEP=10 GPUS=4 bash dist_run_compas.sh coco box_compas configs/mining/faster_rcnn/augs/faster_rcnn_r50_caffe_fpn_1x_coco_partial.py --deterministic

to run the experiment, but I found that the max iteration generated in the code is not 88000 as you reported in the paper, but 176000, where I thought I had not set the configuration correctly.

By the way, how long do we need to train for 1 cycle for the COCO dataset?

I also encountered the GPU memory explosion issue when I was running the following command:

  if [ ${i} != `expr ${TRAIN_STEP} - 1` ]; then
    ckpt="$(ls  ${workdir}/*.pth | sort -V | tail -n1)"
    echo "ckpt: $ckpt"
    python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \
        $(dirname "$0")/tools/test_compas.py $CONFIG ${ckpt} --seed=$seed --work-dir=${workdir} \
        --cfg-options data.pool.ann_file=${workdir}/mixed.json \
        --out ${workdir}/results.pkl --format-only --gpu-collect --active-cycle ${i} --launcher pytorch
  fi

after finishing the initial cycle training on the first round for Compas on COCO dataset. The GPU memory kept increasing during the inference, and it finally ran out of memory after running for more than 90000 images for the COCO dataset.

I noticed that in your function multi_gpu_test in acdet/apis/test.py, you do not move the result predicted by the model to the CPU. May I know how you can avoid the GPU explosion issue?

Here are some logging for your reference:

WORK_DIR: work_dirs 
DATASET: coco 
METHOD: box_compas 
CONFIG: configs/mining/faster_rcnn/augs/faster_rcnn_r50_caffe_fpn_1x_coco_partial_debug.py 
BACKBONE: faster 
TRAINING_SETTING: partial 

GPUS: 4 
IMGS_PER_GPU: 4 
PORT: 29504 
TRAIN_STEP: 10 
QUERY_UNIT: box 
INIT_NUM: 50000 
ADD_NUM: 20000 
DELETE_MODEL: 0 
seed: 2022 

START_ITER: 0 
SUFFIX: --count-box  
branch_name: master 
short_sha: 841c0d01b6b 

epochs:  

TEMPLATE: coco_faster_box_50000_20000_template_2022_master_841c0d01b6b 
TIMESTAMP: coco_faster_box_50000_20000_partial_box_compas_20230930183509_2022_master_841c0d01b6b 
train_ini: true 

workdir: work_dirs/coco_faster_box_50000_20000_partial_box_compas_20230930183509_2022_master_841c0d01b6b/step_0 

2023-09-30 19:08:20,528 - acdet - INFO - Distributed training: True 
2023-09-30 19:08:20,528 - acdet - INFO - Set random seed to 2022, deterministic: True 
2023-09-30 19:08:20,528 - acdet - INFO - ********* Mining for labeled and unlabeled dataset ********* 
2023-09-30 19:08:20,529 - acdet - INFO - rank: 0 
2023-09-30 19:08:20,529 - acdet - INFO - cfg.ratio: 0 
2023-09-30 19:08:20,529 - acdet - INFO - cfg.mining_method.startswith('box_'): True 
2023-09-30 19:08:20,529 - acdet - INFO - Randomly mine 50000 unlabeled data 
2023-09-30 19:08:33,458 - acdet - INFO - 6844 images are selected as labeled. 
2023-09-30 19:08:33,458 - acdet - INFO - 110422 images are selected as unlabeled. 
2023-09-30 19:08:33,458 - acdet - INFO - 0 images are used as labeled in previous step. 
2023-09-30 19:08:33,458 - acdet - INFO - 117266 images are used as unlabeled in previous step.

Answer 1 · 2024-01-03T17:13:11.000Z

Hi wuyujack, thank you for your interest and sorry for the late reply.

As mentioned in the paper, 88000 iterations were conducted with a batch size of 16. The default configuration sets samples_per_gpu=2, so with 4 GPUs, the required iterations are doubled. You could use 8 GPUs or modify the configuration to suit your needs.
For the COCO dataset, we finetune the previous checkpoint for 0.3x iters to trade-off between time consumption and performance with dist_run_compas_resume.sh. After the initial cycle, it takes 6.5h per cycle on 8 V100 for training.
I have updated the shell scripts of COCO as a reference cfg for the above two issues.
It has been improved in the latest commit. Thank you for your reminder.

Feel free to reopen this if you encounter any new issues.