DC1991/G2L_Net

Few questions.

Closed this issue · 5 comments

Hello @DC1991
Thank you for your work.

I run your code and it executed well.
Unfortunately, I don't understand the meaning of output values (R and T). Would you mind give me some explaination ?

In your paper I found this part. Is the code to manipulate including in current source code?
If it isn't included, would you mind give me a link to the code?

However, both LINEMOD and YCB-Video datasets do
not contain the label for each point of the point cloud. To
train G2L-Net in a supervised fashion, we adopt an automatic way to label each point of the point cloud of [?]. As
described in [?], we label each point in two steps§ First, for
the 3D model of an object, we transform it into the camera
coordinate using the corresponding ground truth. We adopt
the implementation provided by [14] for this process.

Thank you

Hi @gachiemchiep Thanks for your interest of the paper. The output value of R is the coordinate of 3D bounding box which is a 24D vector in the paper, and the output value of T is [x, y, z]. The labeling process is not available yet, but we use the implementation in this git (https://github.com/thodan/bop_toolkit) to transfer the 3D object model to the scene.

@DC1991
Thank you for your reply.
I understand the meaning of R. So what is the meaning of [x, y, z] of T?
I will lurk into bop_toolkit to find more detail about creating training dataset.

@gachiemchiep Sorry for unclear description. [x,y,z] means the 3D coordinate of T which is the translation vector.

@DC1991 Thank you for your explaination
I'm trying to visualize the detection result.

Is the depth data and RGB use the same coordinate origin? So the [x, y, z] can be understood as:

  1. (x, y) = coordinate of detected point in image space
  2. z = the value of depth.

Sorry I'm totally loss at understanding the T (translation). Is T the translation value between image and depth ?

@gachiemchiep . We use RGB to locate the 2D bounding box of the object, and we transform the depth image to point cloud with known camera parameters. So [x,y,z] is the 3D coordinates of the points.