Jiyao06/GenPose

about train time

Closed this issue · 23 comments

感谢作者的工作。我尝试去复现作者的结果。当我运行train_score.sh的时候,使用4090,发现预估时间是7分钟一个epoch,但是实际的时间却非常久。就是它只在某些时间利用了GPU,大部分的时间GPU利用率=0,从而导致训练时间很久,请问作者这是什么原因?
一开始正常train:
image
image
然后突然不动了,GPU利用率0:
image
又动了一下
image
image

就这样,进度条前进一下,然后等几分钟不动,再前进一下,很奇怪
image

我预估这样train 一个epoch可能要1个小时以上了,和作者说的10分钟以内完全不同

image
30分钟才train17%

image
tong'g通过print发现
image
bi'r比如现在,13%进度后,print train score time后,突然卡住不动了,没有进入下一次for循环,这是bug还是?

image
我c尝试在dataloader的时间查看,结果发现是串行读取数据的,大部分的时间都在1个个的读取数据,从而导致时间很慢,所以请问作者实际就是需要这么久的时间吗?
image
yi'zhi一直在动,几分钟后,一个batch的数据读取完毕就开始train了

image
应该是使用了16个CPU进行数据读取

我用3090确实是12分钟左右一个epoch 应该也没有出现长时间GPU占用为0的情况

The slow training speed is not due to GPU but rather the slow data loading speed. For the machine you are using, you can adjust the num_workers based on the machine's performance. Additionally, if you still want to increase the training speed after the data loads normally, I suggest you try the following strategies.

1. Reducing total epochs, increasing the value of repeat_num in config: The advantage here is that after loading the data once, it allows for multiple random samples of different T and perturbed pose to obtain more accurate gradients and update the network once.

2. Reducing total epochs, multiple updates per load: Upon each data load, perform multiple random samples of different T and perturbed pose and update the network multiple times.

3. Data preprocessing: While our dataset structure is designed to align with previous studies for a consistent and fair comparison, optimizing data format can be effective. By preprocessing the data—cropping, and sampling object point clouds and then saving them in a .npy format for subsequent training—you can significantly reduce the time spent on data loading.

非常感谢作者的建议。我尝试了一下,首先把数据从机械盘搬到固态盘,fetch一个batch(192)的的时间从26降低到2.6(几乎10倍),然后把get_item里面的方法离线化生成物体的点云,然后保存为npy,最后我的数据占用空间降低为原本的7%,并且一次get_item的读取速度提升接近10倍:
image
原始的get_item时间:
image
优化后
image
最后总的训练一个epoch的时间从一小时降低到20分钟(bs=384,因为我两块4090,想把GPU用满,所以bs翻了一倍)。但是依旧没达到作者所说的10分钟以内。我把bs调整为192后,时间反而变成了30分钟一个epoch,主要因为iterations的步数从900上升到1700左右。
目前主要的问题是训练时候的GPU时间过长
image
所以不知道关于在GPU上train的时候是否有优化空间?同时请问作者train的loss讲到多少大概就收敛了?因为第一个epoch我的loss从9讲到了1,但是后续就讲的很慢。

另外关于get_item部分的深度图转点云存在一些疑惑:我做数据检查的时候发现
image
CAMERA数据集上生成的点云是正常的,但是在Real数据集上生成的点云会带有离群噪点,所以我猜想这可能是算法造成的?主要原代码在这块:
image
没有看的很懂这里从depth生成点云的原理。因为好像会把bbox做随机shift和scale,我猜想是不是这里导致了一些噪声进入?

image
最后发现是双GPU的问题,使用单个GPU4090,我的一个epoch的时间降低到了8分多(bs=192),但是双GPU需要20分钟/epoch (bs=384),此时我该如何选择?confuse!!!

同一的bs192,双卡比单卡要慢
image
这是什么原因?

我用3090确实是12分钟左右一个epoch 应该也没有出现长时间GPU占用为0的情况

hello,请问你有train_score network吗?loss可以降低到多少啊,我loss降低到0.35就下不去了,但是离作者的结果还差10个点

@fuzhao123232 How did you convert it to npy file? Could you possibly share the code you used? Thank you

@fuzhao123232 How did you convert it to npy file? Could you possibly share the code you used? Thank you
(just changed from dataset.get_item)

import sys
import os
import open3d as o3d
import cv2

import numpy as np
import _pickle as cPickle

sys.path.insert(0, '../')
import shutil

from utils.data_augmentation import defor_2D, get_rotation
from utils.datasets_utils import aug_bbox_DZI, get_2d_coord_np, crop_resize_by_warp_affine
from utils.sgpa_utils import load_depth, get_bbox



def get_sym_info( c, mug_handle=1):
    #  sym_info  c0 : face classfication  c1, c2, c3:Three view symmetry, correspond to xy, xz, yz respectively
    # c0: 0 no symmetry 1 axis symmetry 2 two reflection planes 3 unimplemented type
    #  Y axis points upwards, x axis pass through the handle, z axis otherwise
    #
    # for specific defination, see sketch_loss
    if c == 'bottle':
        sym = np.array([1, 1, 0, 1], dtype=np.int8)
    elif c == 'bowl':
        sym = np.array([1, 1, 0, 1], dtype=np.int8)
    elif c == 'camera':
        sym = np.array([0, 0, 0, 0], dtype=np.int8)
    elif c == 'can':
        sym = np.array([1, 1, 1, 1], dtype=np.int8)
    elif c == 'laptop':
        sym = np.array([0, 1, 0, 0], dtype=np.int8)
    elif c == 'mug' and mug_handle == 1:
        sym = np.array([0, 1, 0, 0], dtype=np.int8)  # for mug, we currently mark it as no symmetry
    elif c == 'mug' and mug_handle == 0:
        sym = np.array([1, 0, 0, 0], dtype=np.int8)
    else:
        sym = np.array([0, 0, 0, 0], dtype=np.int8)
    return sym
def _sample_points( pcl, n_pts):
    """ Down sample the point cloud using farthest point sampling.

    Args:
        pcl (torch tensor or numpy array):  NumPoints x 3
        num (int): target point number
    """
    total_pts_num = pcl.shape[0]
    if total_pts_num < n_pts:
        pcl = np.concatenate([np.tile(pcl, (n_pts // total_pts_num, 1)), pcl[:n_pts % total_pts_num]], axis=0)
    elif total_pts_num > n_pts:
        ids = np.random.permutation(total_pts_num)[:n_pts]
        pcl = pcl[ids]
    return pcl
def _depth_to_pcl( depth, K, xymap, mask):
    K = K.reshape(-1)
    cx, cy, fx, fy = K[2], K[5], K[0], K[4]
    depth = depth.reshape(-1).astype(np.float32)
    valid = ((depth > 0) * mask.reshape(-1)) > 0
    depth = depth[valid]
    x_map = xymap[0].reshape(-1)[valid]
    y_map = xymap[1].reshape(-1)[valid]
    real_x = (x_map - cx) * depth / fx
    real_y = (y_map - cy) * depth / fy
    pcl = np.stack((real_x, real_y, depth), axis=-1)
    return pcl.astype(np.float32)
def get_fs_net_scale( c, model, nocs_scale):
    # model pc x 3
    lx = max(model[:, 0]) - min(model[:, 0])
    ly = max(model[:, 1]) - min(model[:, 1])
    lz = max(model[:, 2]) - min(model[:, 2])

    # real scale
    lx_t = lx * nocs_scale * 1000
    ly_t = ly * nocs_scale * 1000
    lz_t = lz * nocs_scale * 1000

    if c == 'bottle':
        unitx = 87
        unity = 220
        unitz = 89
    elif c == 'bowl':
        unitx = 165
        unity = 80
        unitz = 165
    elif c == 'camera':
        unitx = 88
        unity = 128
        unitz = 156
    elif c == 'can':
        unitx = 68
        unity = 146
        unitz = 72
    elif c == 'laptop':
        unitx = 346
        unity = 200
        unitz = 335
    elif c == 'mug':
        unitx = 146
        unity = 83
        unitz = 114
    elif c == '02876657':
        unitx = 324 / 4
        unity = 874 / 4
        unitz = 321 / 4
    elif c == '02880940':
        unitx = 675 / 4
        unity = 271 / 4
        unitz = 675 / 4
    elif c == '02942699':
        unitx = 464 / 4
        unity = 487 / 4
        unitz = 702 / 4
    elif c == '02946921':
        unitx = 450 / 4
        unity = 753 / 4
        unitz = 460 / 4
    elif c == '03642806':
        unitx = 581 / 4
        unity = 445 / 4
        unitz = 672 / 4
    elif c == '03797390':
        unitx = 670 / 4
        unity = 540 / 4
        unitz = 497 / 4
    else:
        unitx = 0
        unity = 0
        unitz = 0
        print('This category is not recorded in my little brain.')
        raise NotImplementedError
    # scale residual
    return np.array([lx_t - unitx, ly_t - unity, lz_t - unitz]), np.array([unitx, unity, unitz])
source = 'CAMERA'
vis = True
data_dir = '../data'
in_rpth = '/media/px_dataset1/fuzhao_datasets/CATEGORY_DATASETS/CAMERA/train/'
save_rpth = '/home/fuzhao/datasets/category_level_datasets/CAMERA/train_npy/'
# model_file_path = ['obj_models/camera_train.pkl', 'obj_models/real_train.pkl','obj_models/camera_val.pkl', 'obj_models/real_test.pkl']
model_file_path = ['obj_models/camera_train.pkl']
id2cat_name_real = {'1': 'bottle', '2': 'bowl', '3': 'camera', '4': 'can', '5': 'laptop', '6': 'mug'}
id2cat_name_CAMERA = {'1': '02876657','2': '02880940', '3': '02942699','4': '02946921', '5': '03642806', '6': '03797390'}
camera_intrinsics = np.array([[577.5, 0, 319.5], [0, 577.5, 239.5], [0, 0, 1]],dtype=np.float32)
real_intrinsics = np.array([[591.0125, 0, 322.525], [0, 590.16775, 244.11084], [0, 0, 1]], dtype=np.float32)
img_size = 256  # cropped image size
dynamic_zoom_in_params = {'DZI_PAD_SCALE': 1.5, 'DZI_TYPE': 'uniform', 'DZI_SCALE_RATIO': 0.25,
                          'DZI_SHIFT_RATIO': 0.25}
deform_2d_params = {'roi_mask_r': 3, 'roi_mask_pro': 0.5}
if source == 'CAMERA':
    id2cat_name = id2cat_name_CAMERA
    out_camK = camera_intrinsics
elif source == 'real':
    id2cat_name = id2cat_name_real
    out_camK = real_intrinsics

models = {}
for path in model_file_path:
    with open(os.path.join(data_dir, path), 'rb') as f:
        models.update(cPickle.load(f))

for i in range(5000):
    scene = in_rpth + '%05d' % i
    scene_id = scene.split("/")[-1]
    save_scene_rpth = save_rpth + f'{scene_id}/'
    if not os.path.exists(save_scene_rpth):
        os.mkdir(save_scene_rpth)
    for i in range(10):   # 循环10张图
        # rgb_pth = scene + f"/000{i}_color.png"
        dep_pth = scene + f"/000{i}_depth.png"
        mask_pth = scene + f"/000{i}_mask.png"
        label_pth = scene + f"/000{i}_label.pkl"
        save_scene_gt_pth = save_scene_rpth + f"000{i}_label.pkl"
        print("-----------------:", dep_pth)
        # rgb = cv2.imread(rgb_pth)
        depth = cv2.imread(dep_pth)
        mask = cv2.imread(mask_pth)
        mask = mask[:, :, 2]
        depth = load_depth(dep_pth)
        im_H, im_W = mask.shape[0], mask.shape[1]
        if os.path.exists(label_pth):
            with open(label_pth, 'rb') as f:
                gts = cPickle.load(f)
            shutil.copy(label_pth, save_scene_gt_pth)
            for j in range(len(gts['instance_ids'])):  # 循环一张图的n个物体,生成点云保存
                # 生成点云,保存点云,关于gt的对应
                # idx = random.randint(0, len(gts['instance_ids']) - 1)
                idx = j   # 他是随机取一个物体,我们是循环
                inst_id = gts['instance_ids'][idx]
                coord_2d = get_2d_coord_np(640, 480).transpose(1, 2, 0)
                rmin, rmax, cmin, cmax = get_bbox(gts['bboxes'][idx])
                bbox_xyxy = np.array([cmin, rmin, cmax, rmax])
                bbox_center, scale = aug_bbox_DZI(dynamic_zoom_in_params, bbox_xyxy, im_H, im_W)  # 随机shift 中心点,scale长边
                # roi_coord_2d ----------------------------------------------------
                roi_coord_2d = crop_resize_by_warp_affine(
                    coord_2d, bbox_center, scale, img_size, interpolation=cv2.INTER_NEAREST
                ).transpose(2, 0, 1)
                mask_target = mask.copy().astype(np.float32)
                mask_target[mask != inst_id] = 0.0
                mask_target[mask == inst_id] = 1.0
                roi_mask = crop_resize_by_warp_affine(
                    mask_target, bbox_center, scale, img_size, interpolation=cv2.INTER_NEAREST
                )
                roi_mask = np.expand_dims(roi_mask, axis=0)
                roi_depth = crop_resize_by_warp_affine(
                    depth, bbox_center, scale, img_size, interpolation=cv2.INTER_NEAREST
                )
                roi_depth = np.expand_dims(roi_depth, axis=0)
                # normalize depth
                depth_valid = roi_depth > 0
                if np.sum(depth_valid) <= 1.0:
                    print("--------error: 有效深度值的和 《 0")
                    continue
                roi_m_d_valid = roi_mask.astype(np.bool_) * depth_valid
                if np.sum(roi_m_d_valid) <= 1.0:
                    print("--------error: 有效深度值的和 《 0")
                    continue

                roi_mask_def = defor_2D(
                    roi_mask,
                    rand_r=deform_2d_params['roi_mask_r'],
                    rand_pro=deform_2d_params['roi_mask_pro']
                )
                pcl_in = _depth_to_pcl(roi_depth, out_camK, roi_coord_2d, roi_mask_def) / 1000.0

                if len(pcl_in) < 50:
                    print("-------error: 点云数量 《 50")
                pcl_in = _sample_points(pcl_in, 1024)

                # vis
                if vis:
                    pcd = o3d.geometry.PointCloud()
                    pcd.points = o3d.utility.Vector3dVector(pcl_in)
                    o3d.visualization.draw_geometries([pcd])
                # save
                save_pts_pth = save_scene_rpth + f'000{i}_{j}_pts.npy'  # 图片i,物体j的点云
                np.save(save_pts_pth, pcl_in)

@fuzhao123232 Thank you very much for sharing your code! I see that you have saved the point clouds for every instances, how did you modify the datasets_genpose.py code accordingly?

@fuzhao123232 Thank you very much for sharing your code! I see that you have saved the point clouds for every instances, how did you modify the datasets_genpose.py code accordingly?

` def getitem(self, index):
# t0 = time.time()
# print(f"----------------------------------------------------------")
img_path = os.path.join(self.data_dir, self.img_list[index])
if img_path in self.invaild_list:
return self.getitem((index + 1) % self.len())
try:
with open(img_path + '_label.pkl', 'rb') as f:
gts = cPickle.load(f)
except:
return self.getitem((index + 1) % self.len())
if 'CAMERA' in img_path.split('/'):
out_camK = self.camera_intrinsics
img_type = 'syn'
else:
out_camK = self.real_intrinsics
img_type = 'real'

    # select one foreground object,
    # if specified, then select the object
    if self.per_obj != '':
        idx = gts['class_ids'].index(self.per_obj_id)
    else:
        idx = random.randint(0, len(gts['instance_ids']) - 1)  # 一张图的实例idx
        ''' 
        ############### remove selected categories ###############
        remove_ids = self.cat_name2id['bowl']
        idx = None
        for i in range(10):
            idx_i = random.randint(0, len(gts['instance_ids']) - 1)
            if gts['class_ids'][idx_i] != remove_ids:
                idx = idx_i
                break
        if idx is None:
            return self.__getitem__((index + 1) % self.__len__())
        ##########################################################
        '''
    if gts['class_ids'][idx] == 6 and img_type == 'real':
        if self.mode == 'train':
            handle_tmp_path = img_path.split('/')
            scene_label = handle_tmp_path[-2] + '_res'
            img_id = int(handle_tmp_path[-1])
            mug_handle = self.mug_sym[scene_label][img_id]
        else:
            mug_handle = gts['handle_visibility'][idx]
    else:
        mug_handle = 1
    # t1 = time.time()
    # print("load gts time:",t1 - t0)
    """load points"""
    idx = random.randint(0, len(gts['instance_ids']) - 1)
    pts_pth = img_path + f'_{idx}_pts.npy'
    if os.path.exists(pts_pth):
        pcl_in = np.load(pts_pth)
    else:
        return self.__getitem__((index + 1) % self.__len__())
    # t2 = time.time()
    # print("load pts time:",t2 - t1)
    # cat_id, rotation translation and scale
    cat_id = gts['class_ids'][idx] - 1  # convert to 0-indexed
    # note that this is nocs model, normalized along diagonal axis
    model_name = gts['model_list'][idx]
    model = self.models[gts['model_list'][idx]].astype(np.float32)  # 1024 points
    nocs_scale = gts['scales'][idx]  # nocs_scale = image file / model file
    # fsnet scale (from model) scale residual
    fsnet_scale, mean_shape = self.get_fs_net_scale(self.id2cat_name[str(cat_id + 1)], model, nocs_scale)
    fsnet_scale = fsnet_scale / 1000.0
    mean_shape = mean_shape / 1000.0
    rotation = gts['rotations'][idx]
    translation = gts['translations'][idx]
    # sym
    sym_info = self.get_sym_info(self.id2cat_name[str(cat_id + 1)], mug_handle=mug_handle)
    # generate augmentation parameters
    bb_aug, rt_aug_t, rt_aug_R = self.generate_aug_parameters()
    # t3 = time.time()
    # print("process time :",t3 - t2)
    # vis data
    # print("image_pth", img_path)
    # pcd = o3d.geometry.PointCloud()
    # pcd.points = o3d.utility.Vector3dVector(pcl_in)
    # o3d.visualization.draw_geometries([pcd])

    data_dict = {}
    data_dict['pcl_in'] = torch.as_tensor(pcl_in.astype(np.float32)).contiguous()
    data_dict['cat_id'] = torch.as_tensor(cat_id, dtype=torch.int8).contiguous()
    data_dict['rotation'] = torch.as_tensor(rotation, dtype=torch.float32).contiguous()
    data_dict['translation'] = torch.as_tensor(translation, dtype=torch.float32).contiguous()
    data_dict['fsnet_scale'] = torch.as_tensor(fsnet_scale, dtype=torch.float32).contiguous()
    data_dict['sym_info'] = torch.as_tensor(sym_info.astype(np.float32)).contiguous()
    data_dict['mean_shape'] = torch.as_tensor(mean_shape, dtype=torch.float32).contiguous()
    data_dict['aug_bb'] = torch.as_tensor(bb_aug, dtype=torch.float32).contiguous()
    data_dict['aug_rt_t'] = torch.as_tensor(rt_aug_t, dtype=torch.float32).contiguous()
    data_dict['aug_rt_R'] = torch.as_tensor(rt_aug_R, dtype=torch.float32).contiguous()
    data_dict['model_point'] = torch.as_tensor(model, dtype=torch.float32).contiguous()
    data_dict['nocs_scale'] = torch.as_tensor(nocs_scale, dtype=torch.float32).contiguous()
    data_dict['handle_visibility'] = torch.as_tensor(int(mug_handle), dtype=torch.int8).contiguous()
    data_dict['path'] = img_path
    # print(f"total time:",time.time() - t0)
    return data_dict `

@fuzhao123232 大佬您好,我使用了您上面将点云保存成npy文件的代码,但是我不知道如何使用,请问您是如何使用保存的npy文件进行训练的呢,文件路径问题呢。
还有就是,您在代码中使用 for i in range(5000):
是否应该将5000改成27500

@fuzhao123232 大佬您好,我使用了您上面将点云保存成npy文件的代码,但是我不知道如何使用,请问您是如何使用保存的npy文件进行训练的呢,文件路径问题呢。 还有就是,您在代码中使用 for i in range(5000): 是否应该将5000改成27500

肯定要改成你数据的长度啊,使用就是把datasets_genpose.py里面的getitem方法重写一下,我上面有code的

@fuzhao123232

@fuzhao123232 大佬您好,我使用了您上面将点云保存成npy文件的代码,但是我不知道如何使用,请问您是如何使用保存的npy文件进行训练的呢,文件路径问题呢。 还有就是,您在代码中使用 for i in range(5000): 是否应该将5000改成27500

肯定要改成你数据的长度啊,使用就是把datasets_genpose.py里面的getitem方法重写一下,我上面有code的

谢谢大佬指点!我在使用了您重写的__getitem__方法后,是否应该将img_path的路径改成保存的nnpy文件的路径呢

@fuzhao123232

@fuzhao123232 大佬您好,我使用了您上面将点云保存成npy文件的代码,但是我不知道如何使用,请问您是如何使用保存的npy文件进行训练的呢,文件路径问题呢。 还有就是,您在代码中使用 for i in range(5000): 是否应该将5000改成27500

肯定要改成你数据的长度啊,使用就是把datasets_genpose.py里面的getitem方法重写一下,我上面有code的

谢谢大佬指点!我在使用了您重写的__getitem__方法后,是否应该将img_path的路径改成保存的nnpy文件的路径呢

你要搞清楚输入是什么,点云和对应的gt,其他都不需要了,rgb,dep,mask都不需要了

@fuzhao123232

@fuzhao123232 大佬您好,我使用了您上面将点云保存成npy文件的代码,但是我不知道如何使用,请问您是如何使用保存的npy文件进行训练的呢,文件路径问题呢。 还有就是,您在代码中使用 for i in range(5000): 是否应该将5000改成27500

肯定要改成你数据的长度啊,使用就是把datasets_genpose.py里面的getitem方法重写一下,我上面有code的

谢谢大佬指点!我在使用了您重写的__getitem__方法后,是否应该将img_path的路径改成保存的nnpy文件的路径呢

你要搞清楚输入是什么,点云和对应的gt,其他都不需要了,rgb,dep,mask都不需要了

好的,谢谢您的指点!^^

image 最后发现是双GPU的问题,使用单个GPU4090,我的一个epoch的时间降低到了8分多(bs=192),但是双GPU需要20分钟/epoch (bs=384),此时我该如何选择?confuse!!!

@fuzhao123232 大佬您好,我看您提到了应用两个gpu进行训练的过程,想问一下您是如何设置用多个GPU训练的?我在试图使用三块GPU进行训练时:在终端中指定CUDA_VISIBLE_DEVICES=0,1,2;或是在py文件中加入nn.DataParallel函数,发现都没有效果;又改动了scripts/train_score.sh文件,将调用的GPU改成了三个,发现确实调用的GPU发生了变化,但最多只有一个GPU被占用,无论是哪一个GPU被调用,其余的两个GPU都没有在工作。

@fuzhao123232 大佬您好,请问您这个score_net最终训练了多久,最后的loss降低到了多少