j96w/DenseFusion

Own dataset training results are not accurate

fbas-est opened this issue · 39 comments

Hi,

I try to train the network on my own dataset but the results are not good enough despite the fact that the model converge.
I’ve a dataset of a total of 3000 annotated images.
My camera is a realsense depth camera D415 with the following parameters:
"fx": 607.3137817382812
"fy": 606.8499145507812
"ppx": 330.49334716796875
"ppy": 239.25704956054688
"height": 480
"width": 640
"depth_scale": 0.0010000000474974513
I’ve created my own dataset.py with respect to the linemod’s dataset.py but I changed the following lines:

cam_scale = 1.0
pt2 = depth_masked / cam_scale
pt0 = (ymap_masked - self.cam_cx) * pt2 / self.cam_fx
pt1 = (xmap_masked - self.cam_cy) * pt2 / self.cam_fy
cloud = np.concatenate((pt0, pt1, pt2), axis=1)
cloud = cloud / 1000.0

to:

cam_scale = self.cam_scale # 0.0010000000474974513
pt2 = depth_masked * cam_scale
pt0 = (ymap_masked - self.cam_cx) * pt2 / self.cam_fx
pt1 = (xmap_masked - self.cam_cy) * pt2/ self.cam_fy
cloud = np.concatenate((pt0, pt1, pt2), axis=1)
cloud = cloud

I also removed every division by 1000 in the code because my mesh values are already in meters.

The object’s diam is: 0.324
The estimator’s loss is: 0.0146578 and
the refiner’s loss is : 0.01338558

Any idea of what is wrong with my iplementation?
Thanks.

@fbas-est
Hello. This is unrelated to your question, but I am also trying to use DenseFusion on my own dataset.
May I ask what your environment settings are (CUDA version, etc.), and the steps for how you successfully managed to build using your own dataset?
Thank you in advance.

Hello, I'm also making my own datasets for training and using realsense camera to estimate the attitude of objects. I've also encountered some problems. Is it convenient to add a contact information for communication? My wechat is 18845107925

@jc0725
Hello I use CUDA 10.1 and PyTorch 1.6.
To build my dataset I used ObjectDatasetTools. You can find the source code from github: https://github.com/F2Wang/ObjectDatasetTools
In order to make it work I changed the format of the dataset to comply with the format of the DenseFusion's Linemod Dataset.

@fbas-est
Thank you for your response.
May I ask how you trained the SegNet for LINEMOD? Did you change the "--dataset_root" directory to LINEMOD instead of YCB in ./vanilla_segmentation/train.py ?

Also, after training, what script did you run to get the 6DoF results?

I apologize if my questions are quite elementary.

@jc0725
Yes. I also changed dataset.py a bit in order to work for my dataset.
A slighty different version of eval_linemod.py with some functions for visualizing the 3D bounding box

@fbas-est
Would it be possible for you to upload your working code to your repository so that I can clone it?

Thank you very much for your reply. I also used ObjectDatasetTools to make my own dataset. I made 10000 pictures of a single object, but after training 20epoch, the posture of the model was changed greatly when I called the model to pose the object. I wanted to ask you how many rounds you trained, and how did you get the green bounding box in your video? Thank you. @fbas-est

3d844a26c702f624fea6619a37124476.mp4

Here is the code for visualizing:
visualize.txt

@Xushuangyin
You produced 10000 pictures from one video or from different videos? In my case I used different videos due to RAM limitations.
The problem was that every video produce pointclouds with different rotation and translation matrices and so the model could not use the same mesh for all the combined dataset.

I made 10000 pictures from different videos. If there are too many pictures, the program will report an error. I made my own object grid. How can I solve the problem you said? @fbas-est

Thank you very much for your code! @fbas-est

@fbas-est
Thank you very much. I will let you know if I am able to make any improvements or if I come up with any suggestions for improved accuracy on your project.

@Xushuangyin
I suggest to begin by finding a way to render the point cloud into the labeled dataset's color images (3D bounding box won't work). If the target pointcloud (the pointcloud used as label) is not accurate then the network won't work.
If that's the problem, then for every video collected you need to change the transforms in the file transforms.npy so that they have one mesh as reference and then label them with that mesh

do you guys resize images during inference ?
i get weird convolution errors :

RuntimeError: Calculated padded input size per channel: (6 x 320). Kernel size: (7 x 7). Kernel size can't be greater than actual input size

RuntimeError: Calculated padded input size per channel: (6 x 287). Kernel size: (7 x 7). Kernel size can't be greater than actual input size

its different each time, so i guess its the image or mask size ? where should i resize ?

@Xushuangyin @fbas-est
thank you

hi @Xushuangyin thank you for responding, i actually found the source it was because i was transposing the array incorrectly.

right now @Xushuangyin i a having issues with nana values in my training when i removed the /1000 since my depth and other metrics are in meters.

I also reduced the learning rate but i still get nan

@Xushuangyin so now i just have giant results. I confirmed that my meshes are in meters so i removed the /1000.

image

Full code here


from importlib.abc import Loader
import torch.utils.data as data
from PIL import Image
import os
import os.path
import errno
import torch
import json
import codecs
import numpy as np
import sys
import torchvision.transforms as transforms
import argparse
import json
import time
import random
import numpy.ma as ma
import copy
import scipy.misc
import scipy.io as scio
import yaml
import cv2


class PoseDataset(data.Dataset):
    def __init__(self, mode, num, add_noise, root, noise_trans, refine):
        self.objlist = [0, 1]
        self.mode = mode

        self.list_rgb = []
        self.list_depth = []
        self.list_label = []
        self.list_obj = []
        self.list_rank = []
        self.meta = {}
        self.pt = {}
        self.root = root
        self.noise_trans = noise_trans
        self.refine = refine
        min = 1000


        item_count = 0
        for item in self.objlist:
            if self.mode == 'train':
                input_file = open('{0}/data/{1}/train.txt'.format(self.root, '%d' % item))
            else:
                input_file = open('{0}/data/{1}/test.txt'.format(self.root, '%d' % item))
            while 1:
                item_count += 1
                input_line = input_file.readline()
                if self.mode == 'test' and item_count % 10 != 0:
                    continue
                if not input_line:
                    break
                if input_line[-1:] == '\n':
                    input_line = input_line[:-1]
                self.list_rgb.append('{0}/data/{1}/rgb/{2}.jpg'.format(self.root, '%d' % item, input_line))
                self.list_depth.append('{0}/data/{1}/depth/{2}.png'.format(self.root, '%d' % item, input_line))
                if self.mode == 'eval':
                    self.list_label.append('{0}/segnet_results/{1}_label/{2}_label.png'.format(self.root, '%d' % item, input_line))
                else:
                    self.list_label.append('{0}/data/{1}/mask/{2}.png'.format(self.root, '%d' % item, input_line))
                
                self.list_obj.append(item)
                self.list_rank.append(int(input_line))

            meta_file = open('{0}/data/{1}/gt.yml'.format(self.root, '%d' % item), 'r')
            self.meta[item] = yaml.safe_load(meta_file)
            self.pt[item] = npy_vtx('{0}/models/{1}.npy'.format(self.root, '%d' % item))

            if len(self.pt[item]) < min:
                min = len(self.pt[item])
            
            print("Object {0} buffer loaded".format(item))

        self.length = len(self.list_rgb)
        self.num_pt_mesh_small = min
        
        # retrieved from /usr/local/zed/settings according to 
        # https://support.stereolabs.com/hc/en-us/articles/360007497173-What-is-the-calibration-file-
        self.cam_cx = 1080.47
        self.cam_cy = 613.322
        self.cam_fx = 1057.8
        self.cam_fy = 1056.61


        self.num = num
        self.add_noise = add_noise
        self.trancolor = transforms.ColorJitter(0.2, 0.2, 0.2, 0.05)
        self.norm = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
        self.border_list = [-1, 40, 80, 120, 160, 200, 240, 280, 320, 360, 400, 440, 480, 520, 560, 600, 640, 680]
        self.num_pt_mesh_large = 500
        # self.num_pt_mesh_small = 100
        self.symmetry_obj_idx = []

    def __getitem__(self, index):
        img = Image.open(self.list_rgb[index])
        ori_img = np.array(img)
        depth = np.array(Image.open(self.list_depth[index]))
        label = np.array(Image.open(self.list_label[index]))


        self.height, self.width, _ = np.shape(img)

        self.xmap = np.array([[j for i in range(self.width)] for j in range(self.height)])
        self.ymap = np.array([[j for i in range(self.width)] for j in range(self.height)])

        # # removing alpha channel
        if np.shape(label)[-1] == 4 :
            label = label[:,:,:-1] 

        obj = self.list_obj[index]
        rank = self.list_rank[index]        

        if obj == 2:
            for i in range(0, len(self.meta[obj][rank])):
                if self.meta[obj][rank][i]['obj_id'] == 2:
                    meta = self.meta[obj][rank][i]
                    break
        else:
            meta = self.meta[obj][rank][0]
        #return array of bools
        mask_depth = ma.getmaskarray(ma.masked_not_equal(depth, 0))
        if self.mode == 'eval':
            mask_label = ma.getmaskarray(ma.masked_equal(label, np.array(255)))
        else:
            mask_label = ma.getmaskarray(ma.masked_equal(label, np.array([255, 255, 255])))[:, :, 0]
        
        mask = mask_label * mask_depth

        if self.add_noise:
            img = self.trancolor(img)

        # remove alpha channel
        img = np.array(img)[:, :, :3]
        img = np.transpose(img, (2, 0, 1))
        img_masked = img

        if self.mode == 'eval':
            rmin, rmax, cmin, cmax = get_bbox(mask_to_bbox(mask_label))
        else: #obj_bb: [minX, minY, widhtOfBbx, heigthOfBbx]
            rmin, rmax, cmin, cmax = get_bbox(meta['obj_bb'])

        img_masked = img_masked[:, rmin:rmax, cmin:cmax]
        # p_img = np.transpose(img_masked, (1, 2, 0))
        # cv2.imwrite('{0}_input.png'.format(index), p_img)

        choose = mask[rmin:rmax, cmin:cmax].flatten().nonzero()[0]
        if len(choose) == 0:
            cc = torch.LongTensor([0])
            return(cc, cc, cc, cc, cc, cc)

        if len(choose) > self.num:
            c_mask = np.zeros(len(choose), dtype=int)
            c_mask[:self.num] = 1
            np.random.shuffle(c_mask)
            choose = choose[c_mask.nonzero()]
        else:
            choose = np.pad(choose, (0, self.num - len(choose)), 'wrap')
        
        depth_masked = depth[rmin:rmax, cmin:cmax].flatten()[choose][:, np.newaxis].astype(np.float32)
        xmap_masked = self.xmap[rmin:rmax, cmin:cmax].flatten()[choose][:, np.newaxis].astype(np.float32)
        ymap_masked = self.ymap[rmin:rmax, cmin:cmax].flatten()[choose][:, np.newaxis].astype(np.float32)
        choose = np.array([choose])

        cam_scale = 1.0
        pt2 = depth_masked / cam_scale
        pt0 = (ymap_masked - self.cam_cx) * pt2 / self.cam_fx
        pt1 = (xmap_masked - self.cam_cy) * pt2 / self.cam_fy
        cloud = np.concatenate((pt0, pt1, pt2), axis=1)
        # cloud = cloud / 1000.0
        cloud = cloud 

        #fw = open('evaluation_result/{0}_cld.xyz'.format(index), 'w')
        #for it in cloud:
        #    fw.write('{0} {1} {2}\n'.format(it[0], it[1], it[2]))
        #fw.close()

        # model_points = self.pt[obj] / 1000.0
        model_points = self.pt[obj]
        dellist = [j for j in range(0, len(model_points))]
        dellist = random.sample(dellist, len(model_points) - self.num_pt_mesh_small)
        model_points = np.delete(model_points, dellist, axis=0)

        target_r = np.resize(np.array(meta['cam_R_m2c']), (3, 3))
        target_t = np.array(meta['cam_t_m2c'])
        add_t = np.array([random.uniform(-self.noise_trans, self.noise_trans) for i in range(3)])

        if self.add_noise:
            cloud = np.add(cloud, add_t)

        #fw = open('evaluation_result/{0}_model_points.xyz'.format(index), 'w')
        #for it in model_points:
        #    fw.write('{0} {1} {2}\n'.format(it[0], it[1], it[2]))
        #fw.close()

        target = np.dot(model_points, target_r.T)
        # if self.add_noise:
        #     target = np.add(target, target_t / 1000.0 + add_t)
        #     out_t = target_t / 1000.0 + add_t
        # else:
        #     target = np.add(target, target_t / 1000.0)
        #     out_t = target_t / 1000.0


        if self.add_noise:
            target = np.add(target, target_t + add_t)
            out_t = target_t + add_t
        else:
            target = np.add(target, target_t)
            out_t = target_t 
        #fw = open('evaluation_result/{0}_tar.xyz'.format(index), 'w')
        #for it in target:
        #    fw.write('{0} {1} {2}\n'.format(it[0], it[1], it[2]))
        #fw.close()

        # np.shape(cloud) (500, 3)
        # np.shape(choose) (1, 500)
        # np.shape(img_masked) (3, 120, 80)
        # np.shape(target) (24, 3)
        # np.shape(model_points) (24, 3)
  
        return torch.from_numpy(cloud.astype(np.float32)), \
               torch.LongTensor(choose.astype(np.int32)), \
               self.norm(torch.from_numpy(img_masked.astype(np.float32))), \
               torch.from_numpy(target.astype(np.float32)), \
               torch.from_numpy(model_points.astype(np.float32)), \
               torch.LongTensor([self.objlist.index(obj)])

    def __len__(self):
        return self.length

    def get_sym_list(self):
        return self.symmetry_obj_idx

    def get_num_points_mesh(self):
        if self.refine:
            return self.num_pt_mesh_large
        else:
            return self.num_pt_mesh_small

border_list = [-1, 40, 80, 120, 160, 200, 240, 280, 320, 360, 400, 440, 480, 520, 560, 600, 640, 680]

def mask_to_bbox(mask):
    mask = mask.astype(np.uint8)
    contours, _ = cv2.findContours(mask, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)


    x = 0
    y = 0
    w = 0
    h = 0
    for contour in contours:
        tmp_x, tmp_y, tmp_w, tmp_h = cv2.boundingRect(contour)
        if tmp_w * tmp_h > w * h:
            x = tmp_x
            y = tmp_y
            w = tmp_w
            h = tmp_h
    return [x, y, w, h]


def get_bbox(bbox):
    bbx = [bbox[1], bbox[1] + bbox[3], bbox[0], bbox[0] + bbox[2]]
    if bbx[0] < 0:
        bbx[0] = 0
    if bbx[1] >= 540:
        bbx[1] = 539
    if bbx[2] < 0:
        bbx[2] = 0
    if bbx[3] >= 960:
        bbx[3] = 959                
    rmin, rmax, cmin, cmax = bbx[0], bbx[1], bbx[2], bbx[3]
    r_b = rmax - rmin
    for tt in range(len(border_list)):
        if r_b > border_list[tt] and r_b < border_list[tt + 1]:
            r_b = border_list[tt + 1]
            break
    c_b = cmax - cmin
    for tt in range(len(border_list)):
        if c_b > border_list[tt] and c_b < border_list[tt + 1]:
            c_b = border_list[tt + 1]
            break
    center = [int((rmin + rmax) / 2), int((cmin + cmax) / 2)]
    rmin = center[0] - int(r_b / 2)
    rmax = center[0] + int(r_b / 2)
    cmin = center[1] - int(c_b / 2)
    cmax = center[1] + int(c_b / 2)
    if rmin < 0:
        delt = -rmin
        rmin = 0
        rmax += delt
    if cmin < 0:
        delt = -cmin
        cmin = 0
        cmax += delt
    if rmax > 540:
        delt = rmax - 540
        rmax = 540
        rmin -= delt
    if cmax > 960:
        delt = cmax - 960
        cmax = 960
        cmin -= delt
    return rmin, rmax, cmin, cmax


def ply_vtx(path):
    f = open(path)
    assert f.readline().strip() == "ply"
    f.readline()
    f.readline()
    N = int(f.readline().split()[-1])
    while f.readline().strip() != "end_header":
        continue
    pts = []
    for _ in range(N):
        pts.append(np.float32(f.readline().split()[:3]))
    return np.array(pts)

def npy_vtx(path):
    return np.load(path,allow_pickle=True)

Thank you for your help @Xushuangyin

Hey @fbas-est , I'm having issues with my training as well. Did you notice anything weird in your avg distance when you removed /1000 ? Did you remove it anywhere else than dataset.py ?

Thank you @Xushuangyin and @an99990 i solve it with the array. Now i have issues with training and gettingd nans too because my stuff are in meters ..
Thanks for any help

cam_scale = 0.001
pt2 = depth_masked * cam_scale
You should change these two lines of code like this

Because of my cam_ Scale = 0.001, so the code I modified is like this
@an99990 @orangeRobot990
148d312f447e3d5fd5762d2a20ce6b2

thank you so much @Xushuangyin , i was able to finally have results using cam_scale/0.001 and without dividing/1000 in getittem. I will start another training with the correct values. thank you so much !

Hello. May I ask how any of you were able to train your custom dataset on SegNet?
It seems like the provided code is for YCB format and not Linemod format.

My guess was that I would have to run the SegNet train.py for each of the individual objects for Linemod.

Thank you for your response.
Do you mean that you didn't train SegNet?

I trained 300 pictures of a single object using Seg Net. @jc0725

@Xushuangyin
Thank you for clarifying!
Also, were you able to successfully visualize the bounding box using the visualize.py code provided by @fbas-est ?

@an99990
Hello I saw that you are using a ZED camera and from the intrinsic array I assume you didn't train the model at 480p resolution images.
Did you successfully trained the model in higher resolution?

@fbas-est I generated image from Unity. The image are 560 x 940 , if I remember correctly. My poses do not seem to be quite correct tho. Heres an image during inference. I might create a dataset with images from the ZED camera. The camera in Unity didnt have the same camera intrinsic as the ZED, so that might be why my results arent precised. I also never reached the refinement step during training.

image

@an99990 Yes that is probably the issue, ZED camera comes with 4 build in calibrations with the smallest being for 672x376 images. If you train the network with synthetic data I guess you have to replicate the images that your camera captures.

May I ask how you created the synthetic dataset ?

i have a Unity project to create dataset with linemode format. I cant share it tho since it is not the companies stuff :/

May I ask how any of you were able to output and save the vanilla_segmentation label png files?

@an99990 Hello. i make a linemod dataset by Objectdatasettools. in the eval_linemod.py, it's success rate is 0.9285. but when i visualize it, the point seems to be in the wrong place. Can you give me some advice? Thank you in advance!
2022-05-17 21-55-54屏幕截图

Have you payed with the cam_scale ? i had to change it to 1000, try with different values, it seems that its bigger than your object

Have you payed with the cam_scale ? i had to change it to 1000, try with different values, it seems that its bigger than your object

@an99990 Thanks for your reply. I make the dataset by realsense. I change the cam_scale to it's own value, like this
cam_scale = 0.0002500000118743628
pt2 = depth_masked * cam_scale
pt0 = (ymap_masked - self.cam_cx) * pt2 / self.cam_fx
pt1 = (xmap_masked - self.cam_cy) * pt2 / self.cam_fy
cloud = np.concatenate((pt0, pt1, pt2), axis=1)
# cloud = cloud / 1000.0
# print(cloud.max())
cloud = cloud

0.0002500000118743628 is the depth scale of real camera.

Hi @Xushuangyin and @an99990. I hope you are doing well. I am trying to train this model on my custom dataset. Can you please share if you were able to successfully train the model? Can you share the results if possible? Thanks.

@jc0725 Hi, I also trained myself to build linemod datasets, and when I debug, I found that 'input_file = open('{0}/data/{1}/train.txt'.format(self.root, '%02d' % item)) 'error' No such file or Directory: '/ datasets/linemod/linemod_preprocessed/data / 01 / train. txt', 'cause I won't be able to view the subsequent code to run through the debug.
But through the command 'bash. / experiments/scripts/train_linemod sh' can be trained, not appear this kind of error, excuse me you had met this kind of situation? Is there any solution?
Thank you very much for your reply.

@fbas-est Hi, I also trained myself to build linemod datasets, and when I debug, I found that 'input_file = open('{0}/data/{1}/train.txt'.format(self.root, '%02d' % item)) 'error' No such file or Directory: '/ datasets/linemod/linemod_preprocessed/data / 01 / train. txt', 'cause I won't be able to view the subsequent code to run through the debug.
But through the command 'bash. / experiments/scripts/train_linemod sh' can be trained, not appear this kind of error, excuse me you had met this kind of situation? Is there any solution?
Thank you very much for your reply.