about train time

Question

about train time

Closed this issue 8 months ago · 23 comments

感谢作者的工作。我尝试去复现作者的结果。当我运行train_score.sh的时候，使用4090，发现预估时间是7分钟一个epoch，但是实际的时间却非常久。就是它只在某些时间利用了GPU，大部分的时间GPU利用率=0，从而导致训练时间很久，请问作者这是什么原因？
一开始正常train：

然后突然不动了，GPU利用率0：

又动了一下

就这样，进度条前进一下，然后等几分钟不动，再前进一下，很奇怪

Answer 1 · 2024-01-23T08:23:37.000Z

我预估这样train 一个epoch可能要1个小时以上了，和作者说的10分钟以内完全不同

Answer 2 · 2024-01-23T08:24:58.000Z

30分钟才train17%

Answer 3 · 2024-01-23T08:39:24.000Z

tong'g通过print发现

bi'r比如现在，13%进度后，print train score time后，突然卡住不动了，没有进入下一次for循环，这是bug还是？

Answer 4 · 2024-01-23T09:08:59.000Z

我c尝试在dataloader的时间查看，结果发现是串行读取数据的，大部分的时间都在1个个的读取数据，从而导致时间很慢，所以请问作者实际就是需要这么久的时间吗？

yi'zhi一直在动，几分钟后，一个batch的数据读取完毕就开始train了

Answer 5 · 2024-01-23T09:28:48.000Z

应该是使用了16个CPU进行数据读取

Answer 6 · 2024-01-23T11:50:38.000Z

我用3090确实是12分钟左右一个epoch 应该也没有出现长时间GPU占用为0的情况

Answer 7 · 2024-01-23T19:41:23.000Z

The slow training speed is not due to GPU but rather the slow data loading speed. For the machine you are using, you can adjust the num_workers based on the machine's performance. Additionally, if you still want to increase the training speed after the data loads normally, I suggest you try the following strategies.

1. Reducing total epochs, increasing the value of repeat_num in config: The advantage here is that after loading the data once, it allows for multiple random samples of different T and perturbed pose to obtain more accurate gradients and update the network once.

2. Reducing total epochs, multiple updates per load: Upon each data load, perform multiple random samples of different T and perturbed pose and update the network multiple times.

3. Data preprocessing: While our dataset structure is designed to align with previous studies for a consistent and fair comparison, optimizing data format can be effective. By preprocessing the data—cropping, and sampling object point clouds and then saving them in a .npy format for subsequent training—you can significantly reduce the time spent on data loading.

Answer 8 · 2024-01-26T07:02:23.000Z

非常感谢作者的建议。我尝试了一下，首先把数据从机械盘搬到固态盘，fetch一个batch(192)的的时间从26降低到2.6（几乎10倍），然后把get_item里面的方法离线化生成物体的点云，然后保存为npy，最后我的数据占用空间降低为原本的7%，并且一次get_item的读取速度提升接近10倍：

原始的get_item时间：

优化后

最后总的训练一个epoch的时间从一小时降低到20分钟（bs=384，因为我两块4090，想把GPU用满，所以bs翻了一倍）。但是依旧没达到作者所说的10分钟以内。我把bs调整为192后，时间反而变成了30分钟一个epoch，主要因为iterations的步数从900上升到1700左右。
目前主要的问题是训练时候的GPU时间过长

所以不知道关于在GPU上train的时候是否有优化空间？同时请问作者train的loss讲到多少大概就收敛了？因为第一个epoch我的loss从9讲到了1，但是后续就讲的很慢。

Answer 9 · 2024-01-26T07:07:15.000Z

另外关于get_item部分的深度图转点云存在一些疑惑：我做数据检查的时候发现

CAMERA数据集上生成的点云是正常的，但是在Real数据集上生成的点云会带有离群噪点，所以我猜想这可能是算法造成的？主要原代码在这块：

没有看的很懂这里从depth生成点云的原理。因为好像会把bbox做随机shift和scale，我猜想是不是这里导致了一些噪声进入？

Answer 10 · 2024-01-26T07:57:55.000Z

最后发现是双GPU的问题，使用单个GPU4090，我的一个epoch的时间降低到了8分多（bs=192），但是双GPU需要20分钟/epoch （bs=384），此时我该如何选择？confuse！！！

Answer 11 · 2024-01-26T08:01:52.000Z

同一的bs192，双卡比单卡要慢

这是什么原因？

Answer 12 · 2024-02-02T02:42:45.000Z

我用3090确实是12分钟左右一个epoch 应该也没有出现长时间GPU占用为0的情况

hello，请问你有train_score network吗？loss可以降低到多少啊，我loss降低到0.35就下不去了，但是离作者的结果还差10个点

Answer 13 · 2024-02-03T03:39:03.000Z

@fuzhao123232 How did you convert it to npy file? Could you possibly share the code you used? Thank you

Answer 14 · 2024-02-06T06:10:11.000Z

@fuzhao123232 How did you convert it to npy file? Could you possibly share the code you used? Thank you
(just changed from dataset.get_item)

import sys
import os
import open3d as o3d
import cv2

import numpy as np
import _pickle as cPickle

sys.path.insert(0, '../')
import shutil

from utils.data_augmentation import defor_2D, get_rotation
from utils.datasets_utils import aug_bbox_DZI, get_2d_coord_np, crop_resize_by_warp_affine
from utils.sgpa_utils import load_depth, get_bbox



def get_sym_info( c, mug_handle=1):
    #  sym_info  c0 : face classfication  c1, c2, c3:Three view symmetry, correspond to xy, xz, yz respectively
    # c0: 0 no symmetry 1 axis symmetry 2 two reflection planes 3 unimplemented type
    #  Y axis points upwards, x axis pass through the handle, z axis otherwise
    #
    # for specific defination, see sketch_loss
    if c == 'bottle':
        sym = np.array([1, 1, 0, 1], dtype=np.int8)
    elif c == 'bowl':
        sym = np.array([1, 1, 0, 1], dtype=np.int8)
    elif c == 'camera':
        sym = np.array([0, 0, 0, 0], dtype=np.int8)
    elif c == 'can':
        sym = np.array([1, 1, 1, 1], dtype=np.int8)
    elif c == 'laptop':
        sym = np.array([0, 1, 0, 0], dtype=np.int8)
    elif c == 'mug' and mug_handle == 1:
        sym = np.array([0, 1, 0, 0], dtype=np.int8)  # for mug, we currently mark it as no symmetry
    elif c == 'mug' and mug_handle == 0:
        sym = np.array([1, 0, 0, 0], dtype=np.int8)
    else:
        sym = np.array([0, 0, 0, 0], dtype=np.int8)
    return sym
def _sample_points( pcl, n_pts):
    """ Down sample the point cloud using farthest point sampling.

    Args:
        pcl (torch tensor or numpy array):  NumPoints x 3
        num (int): target point number
    """
    total_pts_num = pcl.shape[0]
    if total_pts_num < n_pts:
        pcl = np.concatenate([np.tile(pcl, (n_pts // total_pts_num, 1)), pcl[:n_pts % total_pts_num]], axis=0)
    elif total_pts_num > n_pts:
        ids = np.random.permutation(total_pts_num)[:n_pts]
        pcl = pcl[ids]
    return pcl
def _depth_to_pcl( depth, K, xymap, mask):
    K = K.reshape(-1)
    cx, cy, fx, fy = K[2], K[5], K[0], K[4]
    depth = depth.reshape(-1).astype(np.float32)
    valid = ((depth > 0) * mask.reshape(-1)) > 0
    depth = depth[valid]
    x_map = xymap[0].reshape(-1)[valid]
    y_map = xymap[1].reshape(-1)[valid]
    real_x = (x_map - cx) * depth / fx
    real_y = (y_map - cy) * depth / fy
    pcl = np.stack((real_x, real_y, depth), axis=-1)
    return pcl.astype(np.float32)
def get_fs_net_scale( c, model, nocs_scale):
    # model pc x 3
    lx = max(model[:, 0]) - min(model[:, 0])
    ly = max(model[:, 1]) - min(model[:, 1])
    lz = max(model[:, 2]) - min(model[:, 2])

    # real scale
    lx_t = lx * nocs_scale * 1000
    ly_t = ly * nocs_scale * 1000
    lz_t = lz * nocs_scale * 1000

    if c == 'bottle':
        unitx = 87
        unity = 220
        unitz = 89
    elif c == 'bowl':
        unitx = 165
        unity = 80
        unitz = 165
    elif c == 'camera':
        unitx = 88
        unity = 128
        unitz = 156
    elif c == 'can':
        unitx = 68
        unity = 146
        unitz = 72
    elif c == 'laptop':
        unitx = 346
        unity = 200
        unitz = 335
    elif c == 'mug':
        unitx = 146
        unity = 83
        unitz = 114
    elif c == '02876657':
        unitx = 324 / 4
        unity = 874 / 4
        unitz = 321 / 4
    elif c == '02880940':
        unitx = 675 / 4
        unity = 271 / 4
        unitz = 675 / 4
    elif c == '02942699':
        unitx = 464 / 4
        unity = 487 / 4
        unitz = 702 / 4
    elif c == '02946921':
        unitx = 450 / 4
        unity = 753 / 4
        unitz = 460 / 4
    elif c == '03642806':
        unitx = 581 / 4
        unity = 445 / 4
        unitz = 672 / 4
    elif c == '03797390':
        unitx = 670 / 4
        unity = 540 / 4
        unitz = 497 / 4
    else:
        unitx = 0
        unity = 0
        unitz = 0
        print('This category is not recorded in my little brain.')
        raise NotImplementedError
    # scale residual
    return np.array([lx_t - unitx, ly_t - unity, lz_t - unitz]), np.array([unitx, unity, unitz])
source = 'CAMERA'
vis = True
data_dir = '../data'
in_rpth = '/media/px_dataset1/fuzhao_datasets/CATEGORY_DATASETS/CAMERA/train/'
save_rpth = '/home/fuzhao/datasets/category_level_datasets/CAMERA/train_npy/'
# model_file_path = ['obj_models/camera_train.pkl', 'obj_models/real_train.pkl','obj_models/camera_val.pkl', 'obj_models/real_test.pkl']
model_file_path = ['obj_models/camera_train.pkl']
id2cat_name_real = {'1': 'bottle', '2': 'bowl', '3': 'camera', '4': 'can', '5': 'laptop', '6': 'mug'}
id2cat_name_CAMERA = {'1': '02876657','2': '02880940', '3': '02942699','4': '02946921', '5': '03642806', '6': '03797390'}
camera_intrinsics = np.array([[577.5, 0, 319.5], [0, 577.5, 239.5], [0, 0, 1]],dtype=np.float32)
real_intrinsics = np.array([[591.0125, 0, 322.525], [0, 590.16775, 244.11084], [0, 0, 1]], dtype=np.float32)
img_size = 256  # cropped image size
dynamic_zoom_in_params = {'DZI_PAD_SCALE': 1.5, 'DZI_TYPE': 'uniform', 'DZI_SCALE_RATIO': 0.25,
                          'DZI_SHIFT_RATIO': 0.25}
deform_2d_params = {'roi_mask_r': 3, 'roi_mask_pro': 0.5}
if source == 'CAMERA':
    id2cat_name = id2cat_name_CAMERA
    out_camK = camera_intrinsics
elif source == 'real':
    id2cat_name = id2cat_name_real
    out_camK = real_intrinsics

models = {}
for path in model_file_path:
    with open(os.path.join(data_dir, path), 'rb') as f:
        models.update(cPickle.load(f))

for i in range(5000):
    scene = in_rpth + '%05d' % i
    scene_id = scene.split("/")[-1]
    save_scene_rpth = save_rpth + f'{scene_id}/'
    if not os.path.exists(save_scene_rpth):
        os.mkdir(save_scene_rpth)
    for i in range(10):   # 循环10张图
        # rgb_pth = scene + f"/000{i}_color.png"
        dep_pth = scene + f"/000{i}_depth.png"
        mask_pth = scene + f"/000{i}_mask.png"
        label_pth = scene + f"/000{i}_label.pkl"
        save_scene_gt_pth = save_scene_rpth + f"000{i}_label.pkl"
        print("-----------------:", dep_pth)
        # rgb = cv2.imread(rgb_pth)
        depth = cv2.imread(dep_pth)
        mask = cv2.imread(mask_pth)
        mask = mask[:, :, 2]
        depth = load_depth(dep_pth)
        im_H, im_W = mask.shape[0], mask.shape[1]
        if os.path.exists(label_pth):
            with open(label_pth, 'rb') as f:
                gts = cPickle.load(f)
            shutil.copy(label_pth, save_scene_gt_pth)
            for j in range(len(gts['instance_ids'])):  # 循环一张图的n个物体，生成点云保存
                # 生成点云，保存点云，关于gt的对应
                # idx = random.randint(0, len(gts['instance_ids']) - 1)
                idx = j   # 他是随机取一个物体，我们是循环
                inst_id = gts['instance_ids'][idx]
                coord_2d = get_2d_coord_np(640, 480).transpose(1, 2, 0)
                rmin, rmax, cmin, cmax = get_bbox(gts['bboxes'][idx])
                bbox_xyxy = np.array([cmin, rmin, cmax, rmax])
                bbox_center, scale = aug_bbox_DZI(dynamic_zoom_in_params, bbox_xyxy, im_H, im_W)  # 随机shift 中心点，scale长边
                # roi_coord_2d ----------------------------------------------------
                roi_coord_2d = crop_resize_by_warp_affine(
                    coord_2d, bbox_center, scale, img_size, interpolation=cv2.INTER_NEAREST
                ).transpose(2, 0, 1)
                mask_target = mask.copy().astype(np.float32)
                mask_target[mask != inst_id] = 0.0
                mask_target[mask == inst_id] = 1.0
                roi_mask = crop_resize_by_warp_affine(
                    mask_target, bbox_center, scale, img_size, interpolation=cv2.INTER_NEAREST
                )
                roi_mask = np.expand_dims(roi_mask, axis=0)
                roi_depth = crop_resize_by_warp_affine(
                    depth, bbox_center, scale, img_size, interpolation=cv2.INTER_NEAREST
                )
                roi_depth = np.expand_dims(roi_depth, axis=0)
                # normalize depth
                depth_valid = roi_depth > 0
                if np.sum(depth_valid) <= 1.0:
                    print("--------error: 有效深度值的和 《 0")
                    continue
                roi_m_d_valid = roi_mask.astype(np.bool_) * depth_valid
                if np.sum(roi_m_d_valid) <= 1.0:
                    print("--------error: 有效深度值的和 《 0")
                    continue

                roi_mask_def = defor_2D(
                    roi_mask,
                    rand_r=deform_2d_params['roi_mask_r'],
                    rand_pro=deform_2d_params['roi_mask_pro']
                )
                pcl_in = _depth_to_pcl(roi_depth, out_camK, roi_coord_2d, roi_mask_def) / 1000.0

                if len(pcl_in) < 50:
                    print("-------error: 点云数量 《 50")
                pcl_in = _sample_points(pcl_in, 1024)

                # vis
                if vis:
                    pcd = o3d.geometry.PointCloud()
                    pcd.points = o3d.utility.Vector3dVector(pcl_in)
                    o3d.visualization.draw_geometries([pcd])
                # save
                save_pts_pth = save_scene_rpth + f'000{i}_{j}_pts.npy'  # 图片i，物体j的点云
                np.save(save_pts_pth, pcl_in)

Answer 15 · 2024-02-18T06:16:43.000Z

@fuzhao123232 Thank you very much for sharing your code! I see that you have saved the point clouds for every instances, how did you modify the datasets_genpose.py code accordingly?

Answer 16 · 2024-02-21T01:59:58.000Z

@fuzhao123232 Thank you very much for sharing your code! I see that you have saved the point clouds for every instances, how did you modify the datasets_genpose.py code accordingly?

` def getitem(self, index):
# t0 = time.time()
# print(f"----------------------------------------------------------")
img_path = os.path.join(self.data_dir, self.img_list[index])
if img_path in self.invaild_list:
return self.getitem((index + 1) % self.len())
try:
with open(img_path + '_label.pkl', 'rb') as f:
gts = cPickle.load(f)
except:
return self.getitem((index + 1) % self.len())
if 'CAMERA' in img_path.split('/'):
out_camK = self.camera_intrinsics
img_type = 'syn'
else:
out_camK = self.real_intrinsics
img_type = 'real'

    # select one foreground object,
    # if specified, then select the object
    if self.per_obj != '':
        idx = gts['class_ids'].index(self.per_obj_id)
    else:
        idx = random.randint(0, len(gts['instance_ids']) - 1)  # 一张图的实例idx
        ''' 
        ############### remove selected categories ###############
        remove_ids = self.cat_name2id['bowl']
        idx = None
        for i in range(10):
            idx_i = random.randint(0, len(gts['instance_ids']) - 1)
            if gts['class_ids'][idx_i] != remove_ids:
                idx = idx_i
                break
        if idx is None:
            return self.__getitem__((index + 1) % self.__len__())
        ##########################################################
        '''
    if gts['class_ids'][idx] == 6 and img_type == 'real':
        if self.mode == 'train':
            handle_tmp_path = img_path.split('/')
            scene_label = handle_tmp_path[-2] + '_res'
            img_id = int(handle_tmp_path[-1])
            mug_handle = self.mug_sym[scene_label][img_id]
        else:
            mug_handle = gts['handle_visibility'][idx]
    else:
        mug_handle = 1
    # t1 = time.time()
    # print("load gts time:",t1 - t0)
    """load points"""
    idx = random.randint(0, len(gts['instance_ids']) - 1)
    pts_pth = img_path + f'_{idx}_pts.npy'
    if os.path.exists(pts_pth):
        pcl_in = np.load(pts_pth)
    else:
        return self.__getitem__((index + 1) % self.__len__())
    # t2 = time.time()
    # print("load pts time:",t2 - t1)
    # cat_id, rotation translation and scale
    cat_id = gts['class_ids'][idx] - 1  # convert to 0-indexed
    # note that this is nocs model, normalized along diagonal axis
    model_name = gts['model_list'][idx]
    model = self.models[gts['model_list'][idx]].astype(np.float32)  # 1024 points
    nocs_scale = gts['scales'][idx]  # nocs_scale = image file / model file
    # fsnet scale (from model) scale residual
    fsnet_scale, mean_shape = self.get_fs_net_scale(self.id2cat_name[str(cat_id + 1)], model, nocs_scale)
    fsnet_scale = fsnet_scale / 1000.0
    mean_shape = mean_shape / 1000.0
    rotation = gts['rotations'][idx]
    translation = gts['translations'][idx]
    # sym
    sym_info = self.get_sym_info(self.id2cat_name[str(cat_id + 1)], mug_handle=mug_handle)
    # generate augmentation parameters
    bb_aug, rt_aug_t, rt_aug_R = self.generate_aug_parameters()
    # t3 = time.time()
    # print("process time :",t3 - t2)
    # vis data
    # print("image_pth", img_path)
    # pcd = o3d.geometry.PointCloud()
    # pcd.points = o3d.utility.Vector3dVector(pcl_in)
    # o3d.visualization.draw_geometries([pcd])

    data_dict = {}
    data_dict['pcl_in'] = torch.as_tensor(pcl_in.astype(np.float32)).contiguous()
    data_dict['cat_id'] = torch.as_tensor(cat_id, dtype=torch.int8).contiguous()
    data_dict['rotation'] = torch.as_tensor(rotation, dtype=torch.float32).contiguous()
    data_dict['translation'] = torch.as_tensor(translation, dtype=torch.float32).contiguous()
    data_dict['fsnet_scale'] = torch.as_tensor(fsnet_scale, dtype=torch.float32).contiguous()
    data_dict['sym_info'] = torch.as_tensor(sym_info.astype(np.float32)).contiguous()
    data_dict['mean_shape'] = torch.as_tensor(mean_shape, dtype=torch.float32).contiguous()
    data_dict['aug_bb'] = torch.as_tensor(bb_aug, dtype=torch.float32).contiguous()
    data_dict['aug_rt_t'] = torch.as_tensor(rt_aug_t, dtype=torch.float32).contiguous()
    data_dict['aug_rt_R'] = torch.as_tensor(rt_aug_R, dtype=torch.float32).contiguous()
    data_dict['model_point'] = torch.as_tensor(model, dtype=torch.float32).contiguous()
    data_dict['nocs_scale'] = torch.as_tensor(nocs_scale, dtype=torch.float32).contiguous()
    data_dict['handle_visibility'] = torch.as_tensor(int(mug_handle), dtype=torch.int8).contiguous()
    data_dict['path'] = img_path
    # print(f"total time:",time.time() - t0)
    return data_dict `

Answer 17 · 2024-03-11T02:07:46.000Z

@fuzhao123232 大佬您好，我使用了您上面将点云保存成npy文件的代码，但是我不知道如何使用，请问您是如何使用保存的npy文件进行训练的呢，文件路径问题呢。
还有就是，您在代码中使用 for i in range(5000):
是否应该将5000改成27500

Answer 18 · 2024-03-11T02:12:01.000Z

@fuzhao123232 大佬您好，我使用了您上面将点云保存成npy文件的代码，但是我不知道如何使用，请问您是如何使用保存的npy文件进行训练的呢，文件路径问题呢。还有就是，您在代码中使用 for i in range(5000): 是否应该将5000改成27500

肯定要改成你数据的长度啊，使用就是把datasets_genpose.py里面的getitem方法重写一下，我上面有code的

Answer 19 · 2024-03-11T02:22:37.000Z

@fuzhao123232

@fuzhao123232 大佬您好，我使用了您上面将点云保存成npy文件的代码，但是我不知道如何使用，请问您是如何使用保存的npy文件进行训练的呢，文件路径问题呢。还有就是，您在代码中使用 for i in range(5000): 是否应该将5000改成27500

肯定要改成你数据的长度啊，使用就是把datasets_genpose.py里面的getitem方法重写一下，我上面有code的

谢谢大佬指点！我在使用了您重写的__getitem__方法后，是否应该将img_path的路径改成保存的nnpy文件的路径呢

Answer 20 · 2024-03-12T02:12:40.000Z

@fuzhao123232

@fuzhao123232 大佬您好，我使用了您上面将点云保存成npy文件的代码，但是我不知道如何使用，请问您是如何使用保存的npy文件进行训练的呢，文件路径问题呢。还有就是，您在代码中使用 for i in range(5000): 是否应该将5000改成27500

肯定要改成你数据的长度啊，使用就是把datasets_genpose.py里面的getitem方法重写一下，我上面有code的

谢谢大佬指点！我在使用了您重写的__getitem__方法后，是否应该将img_path的路径改成保存的nnpy文件的路径呢

你要搞清楚输入是什么，点云和对应的gt，其他都不需要了，rgb，dep，mask都不需要了

Answer 21 · 2024-03-12T02:29:40.000Z

@fuzhao123232

@fuzhao123232 大佬您好，我使用了您上面将点云保存成npy文件的代码，但是我不知道如何使用，请问您是如何使用保存的npy文件进行训练的呢，文件路径问题呢。还有就是，您在代码中使用 for i in range(5000): 是否应该将5000改成27500

肯定要改成你数据的长度啊，使用就是把datasets_genpose.py里面的getitem方法重写一下，我上面有code的

谢谢大佬指点！我在使用了您重写的__getitem__方法后，是否应该将img_path的路径改成保存的nnpy文件的路径呢

你要搞清楚输入是什么，点云和对应的gt，其他都不需要了，rgb，dep，mask都不需要了

好的，谢谢您的指点！^^

Answer 22 · 2024-05-03T02:42:01.000Z

最后发现是双GPU的问题，使用单个GPU4090，我的一个epoch的时间降低到了8分多（bs=192），但是双GPU需要20分钟/epoch （bs=384），此时我该如何选择？confuse！！！

@fuzhao123232 大佬您好，我看您提到了应用两个gpu进行训练的过程，想问一下您是如何设置用多个GPU训练的？我在试图使用三块GPU进行训练时：在终端中指定CUDA_VISIBLE_DEVICES=0，1，2；或是在py文件中加入nn.DataParallel函数，发现都没有效果；又改动了scripts/train_score.sh文件，将调用的GPU改成了三个，发现确实调用的GPU发生了变化，但最多只有一个GPU被占用，无论是哪一个GPU被调用，其余的两个GPU都没有在工作。

Answer 23 · 2024-09-13T09:32:04.000Z

@fuzhao123232 大佬您好，请问您这个score_net最终训练了多久，最后的loss降低到了多少