The COCO training command and GPU explosion issue when doing inference after training
Closed this issue · 1 comments
Hi Mengyao,
Thank you so much for releasing the code on GitHub! I am trying to replicate your result on COCO, but there is only the command for VOC in the README file.
Could you give me the exact command you have used to run the COCO experiments? I am trying to use
SEED=2022 QUERY_UNIT=box INIT_NUM=50000 ADD_NUM=20000 TRAIN_STEP=10 GPUS=4 bash dist_run_compas.sh coco box_compas configs/mining/faster_rcnn/augs/faster_rcnn_r50_caffe_fpn_1x_coco_partial.py --deterministic
to run the experiment, but I found that the max iteration generated in the code is not 88000 as you reported in the paper, but 176000, where I thought I had not set the configuration correctly.
By the way, how long do we need to train for 1 cycle for the COCO dataset?
I also encountered the GPU memory explosion issue when I was running the following command:
if [ ${i} != `expr ${TRAIN_STEP} - 1` ]; then
ckpt="$(ls ${workdir}/*.pth | sort -V | tail -n1)"
echo "ckpt: $ckpt"
python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \
$(dirname "$0")/tools/test_compas.py $CONFIG ${ckpt} --seed=$seed --work-dir=${workdir} \
--cfg-options data.pool.ann_file=${workdir}/mixed.json \
--out ${workdir}/results.pkl --format-only --gpu-collect --active-cycle ${i} --launcher pytorch
fi
after finishing the initial cycle training on the first round for Compas on COCO dataset. The GPU memory kept increasing during the inference, and it finally ran out of memory after running for more than 90000 images for the COCO dataset.
I noticed that in your function multi_gpu_test
in acdet/apis/test.py
, you do not move the result predicted by the model to the CPU. May I know how you can avoid the GPU explosion issue?
Here are some logging for your reference:
WORK_DIR: work_dirs
DATASET: coco
METHOD: box_compas
CONFIG: configs/mining/faster_rcnn/augs/faster_rcnn_r50_caffe_fpn_1x_coco_partial_debug.py
BACKBONE: faster
TRAINING_SETTING: partial
GPUS: 4
IMGS_PER_GPU: 4
PORT: 29504
TRAIN_STEP: 10
QUERY_UNIT: box
INIT_NUM: 50000
ADD_NUM: 20000
DELETE_MODEL: 0
seed: 2022
START_ITER: 0
SUFFIX: --count-box
branch_name: master
short_sha: 841c0d01b6b
epochs:
TEMPLATE: coco_faster_box_50000_20000_template_2022_master_841c0d01b6b
TIMESTAMP: coco_faster_box_50000_20000_partial_box_compas_20230930183509_2022_master_841c0d01b6b
train_ini: true
workdir: work_dirs/coco_faster_box_50000_20000_partial_box_compas_20230930183509_2022_master_841c0d01b6b/step_0
2023-09-30 19:08:20,528 - acdet - INFO - Distributed training: True
2023-09-30 19:08:20,528 - acdet - INFO - Set random seed to 2022, deterministic: True
2023-09-30 19:08:20,528 - acdet - INFO - ********* Mining for labeled and unlabeled dataset *********
2023-09-30 19:08:20,529 - acdet - INFO - rank: 0
2023-09-30 19:08:20,529 - acdet - INFO - cfg.ratio: 0
2023-09-30 19:08:20,529 - acdet - INFO - cfg.mining_method.startswith('box_'): True
2023-09-30 19:08:20,529 - acdet - INFO - Randomly mine 50000 unlabeled data
2023-09-30 19:08:33,458 - acdet - INFO - 6844 images are selected as labeled.
2023-09-30 19:08:33,458 - acdet - INFO - 110422 images are selected as unlabeled.
2023-09-30 19:08:33,458 - acdet - INFO - 0 images are used as labeled in previous step.
2023-09-30 19:08:33,458 - acdet - INFO - 117266 images are used as unlabeled in previous step.
Hi wuyujack, thank you for your interest and sorry for the late reply.
- As mentioned in the paper, 88000 iterations were conducted with a batch size of 16. The default configuration sets
samples_per_gpu=2
, so with 4 GPUs, the required iterations are doubled. You could use 8 GPUs or modify the configuration to suit your needs. - For the COCO dataset, we finetune the previous checkpoint for 0.3x iters to trade-off between time consumption and performance with
dist_run_compas_resume.sh
. After the initial cycle, it takes 6.5h per cycle on 8 V100 for training.
I have updated the shell scripts of COCO as a reference cfg for the above two issues. - It has been improved in the latest commit. Thank you for your reminder.
Feel free to reopen this if you encounter any new issues.