A question about Section A.5.1 in your paper (data preparation).
yanjk3 opened this issue · 4 comments
In section A.5.1 DATA PREPARATION AND DIFFUSION MODEL TRAINING:
The authors mentioned, "32 uniformly distributed azimuth angles are used for rendering, starting from the front view."
Does it mean that this view must be frontal if the azimuth angle is 0?
I am curious about how to implement that, or, how to determine which view is the front view.
Is this constraint manually ensured by humans?
Besides, will the used training data be released?
I sincerely hope these data could be released for research.
Well actually I meant starting from 0 azimuth degree. It is not guaranteed to be a front view. I will correct this in the paper.
Thanks for the reply :)
However, I noticed that during inference, the azimuth=0 always corresponded with the frontal of the generated object.
For example, "A bulldog wearing a black pirate hat", when azimuth=0, the generated image is the front view of the bulldog.
This phenomenon is also shown in other text prompts.
I feel that this correspondence is strange because there seems to be no mechanism or design to ensure that the model will associate azimuth=0 with the front view of the object.
I am confused about how did the model learn this correspondence?
When I read your paper, I guess the reason is that you already associated azimuth=0 with the front side of the object when rendering the Objaverse dataset, but you denied my guess.
I now suspect that azimuth=0 itself is related to the frontal naturally?
Since the code for rendering the Objaverse dataset is not provided, could you please check if azimuth=0 corresponds to the front view?
I think this is important for the convergence of the model.
If they do not correspond, then the images generated by a specific camera position are ambiguous, ie, the model only knows the relative angles of the 4 generated images but cannot determine the absolute angle of view of a specific image.
Yeah, there is no guarantee that the azimuth=0 is front-view. But in data, we use clip feature to align different 3D models based on the renderred 32 views, so that when training, the abosolute camera making sense.
Thanks for the reply!
Well, I know the camera position makes sense, it can help models sense the changing viewpoints.
I'm curious why azimuth=90 (the default setting in the t2i.py) always generates a front view of the object, \eg, the samples you have shown in README.md.
After multiple inference attempts, I found that azimuth=90 will generate a frontal view with a high probability, but not 100%.
Then, I tried to render the Objaverse dataset myself, and I found that the rendering results did show a certain correspondence between the view (such as front, side, and back view) and camera position, but it was not completely corresponding either.
For example, in most cases (not all cases), the rendering results under azimuth=90 is the front view of the object.
In fact, I have only recently been exposed to 3D vision, and I think the above correspondence may be some convention?
Besides, I think that this correspondence provides a very very strong priori to the diffusion model, which will greatly reduce the difficulty of training the model, and thus it can generate highly consistent 3D objects.
And I have a last question here: why are 32 views used?
Have you ever used fewer views? For example, 20, 16, or less? How do they perform?