[Question] Training a camera position controlnet ?

Question

[Question] Training a camera position controlnet ?

arthurwolf opened this issue a year ago · 7 comments

Hello!

Thanks for the amazing project.

I'm often in the situation where I've generated a scene I really like, but I'd like to rotate the camera a bit more to the right, or zoom in, or put the camera a bit higher up, etc.

Currently, the only way I found to do this would be to generate a 3D model of the scene (possibly automatically from a controlnet-generated depth map?), rotate that, generate a new depth map, and use that to regenerate the map.

But:

1. This is cumbersome/slow, and
1. This would only let me move the camera small amounts at a time

Another option somebody suggested, was training LORAs on specific angles, and have many LORAs for many different angles/camera position. But again, pretty cumbersome (and a lot of training). Also, not even sure if that'd work.

Or train a single LORA, but with a dataset that matches many different angle "keywords" to many different positioned images? As you can see, I'm a bit lost.

I figured what I really want to do, is manipulate the part of the model's "internal conception" of the scene that defines its rotation (if there is such a thing...). Like there has to be some set of weights that defines if we look at a subject from the front or the back, if a face is seen sideways or 3 quarters, etc.

So my question is, would it be possible to create a controlnet that would do this?

My main problem I see, is controlnet training, as described in

https://github.com/lllyasviel/ControlNet/blob/main/docs/train.md

takes images as an input.

But the input in my case wouldn't be an image, it would be an angle.

So my best guess of how to do this (and this is likely completely wrong), would be:

Take a 3D scene.
Render it at a specific angle/zoom/camera position.
Take that generated image, and a text description of the camera position : angle212 height1.54 etc. Or maybe (angle 0.3) (height 0.25) ? I.e. play on the strength of the tokens? Something like that.
Add each pair of generated image and corresponding position text to the dataset (completely ignoring the "black and white" input image)
Generate thousands, train.

grocery store, shop, vector graphic, rotation-0.5, height-0.5, distance-0.3, sunrotation-0.2, sunheight-0.5, sundistance-1.0,

Would this work? Does it have any chance to? If not, is there any way to do this that would work?

Would a single 3D scene (or even a dumb cube on a plane) work, or do I need a large variety of scenes?

I would love some kind of input/feedback/advice on this.

Thanks so much to anyone who takes the time to reply.

Cheers.

Answer 1 · 2024-12-13T22:46:39.000Z

@arthurwolf This is most likely not possible with the ControlNet implementation in this repo since it relies on converting the guidance into an image that the UNet can understand, it not really make sense to convert camera positions data such as rotation and position into a image format like you mentioned. This seems like it would make more sense to generate an image then control the camera position with NeRF (neural radiance fields) or a image-to-3D model like https://huggingface.co/stabilityai/TripoSR. Directly using the diffusion model to control camera angles would probably require some kind of text conditioning on camera position data to guide the model.

Answer 2 · 2024-12-14T16:37:03.000Z

Thank a lot for the feedback. My plan currently is to create a dataset by generating a large set of 3D renders of scenes in various orientations, and associated keywords/tokens for each of the parameters (camera_z_123, sunlit, sun_distance_25, etc ), publish the dataset, and then let somebody more competent turn that into a usable tool (and/or if nobody does, try to learn how to do it myself. but start with the dataset as that's required no matter what).

…

On Fri, Dec 13, 2024 at 11:47 PM JohnnyRacer ***@***.***> wrote: @arthurwolf <https://github.com/arthurwolf> This is most likely not possible with the ControlNet implementation in this repo since it relies on converting the guidance into an image that the UNet can understand, it not really make sense to convert camera positions data such as rotation and position into a image format like you mentioned. This seems like it would make more sense to generate an image then control the camera position with NeRF (neural radiance fields) or a image-to-3D model like https://huggingface.co/stabilityai/TripoSR. Directly using the diffusion model to control camera angles would probably require some kind of text conditioning on camera position data to guide the model. — Reply to this email directly, view it on GitHub <#699 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA2SFNAC7YV5X5IGVUEDJL2FNPWJAVCNFSM6AAAAABM7ZP4ZGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBSGUYDOOJXGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- 勇気とユーモア

Answer 3 · 2024-12-14T20:30:14.000Z

That's similar to creating a dataset for a NeRF model, which is very challenging since it requires quite a massive amount of high quality 3D renders (you don't want the diffusion model's image quality to degrade from conditioning on low quality renders) with a variety of novel views for them. I think it would be easier to add the positions as a caption from a VLM or a camera position estimator (OpenCV and such) on an existing text-to-image dataset (LAION-2B for example) rather than generating novel views from 3D renders. Which would require an immense number of 3D scenes and objects dataset (Objectverse dataset or something similar) or scenes to prevent overfitting but even then model could overfit to the actual visual information of the 3D scene rather than the camera angles you would want it to be conditioned on (remember the diffusion model has absolutely no idea what you are trying to make it learn, it is simply learning patterns in the data that are the most prominent) .

Answer 4 · 2024-12-14T20:46:02.000Z

So the idea here is in fact not to generate high quality randers, but to render crappy quality renders, turn those into canny / lineart images, and feed that to a controlnet to actually generate images in many styles (realistic, anime, etc). Also planning to use ipadapter to "steal" the style of existing images. I've already tested this render -> canny -> controlnet -> generation process, and it works very well, follows the geometry/orientation very well, and it's pretty inexpensive to generate images in a lot of orientations and styles.

…

On Sat, Dec 14, 2024 at 9:30 PM JohnnyRacer ***@***.***> wrote: That's similar to creating a dataset for a NeRF model, which is very challenging since it requires quite a massive amount of high quality 3D renders (you don't want the diffusion model's image quality to degrade from conditioning on low quality renders) with a variety of novel views for them. I think it would be easier to add the positions as a caption from a VLM or a camera position estimator (OpenCV and such) on an existing text-to-image dataset (LAION-2B for example) rather than generating novel views from 3D renders. Which would require an immense number of 3D scenes and objects dataset (Objectverse dataset or something similar) or scenes to prevent overfitting but even then model could overfit to the actual *visual* information of the 3D scene rather than the camera angles you would want it to be conditioned on (remember the diffusion model has absolutely no idea what you are trying to make it learn, it is simply learning patterns in the data that are the most prominent) . — Reply to this email directly, view it on GitHub <#699 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA2SFMVWPMRGAFYNB2JXRD2FSIO3AVCNFSM6AAAAABM7ZP4ZGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBTGMZTKOBWGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- 勇気とユーモア

Answer 5 · 2024-12-14T21:38:25.000Z

That is very interesting as a dataset generation technique, but I am curious to how to ensure there is enough visual variety in the dataset for the model to still maintain the image generation quality across different prompts? Since the underrepresented images will have signifcantly lowered quality due to the model trying to generalize from a small sample.

Answer 6 · 2024-12-14T22:52:35.000Z

The plan is to take existing datasets (so millions of images), have visual models like Florence2 describe the image (focusing on the style), and then use that image description data as the basis for generating the "style" of the generations. That, plus static lists of "styles" (I have a list of like a 100), plus seed-based randomness, plus inserting random tokens on top of that, I think will give me some pretty good variety. It does in "manual" tests. That's one thing that's nice about diffusion models, it's pretty easy to get variation.

…

On Sat, Dec 14, 2024 at 10:38 PM JohnnyRacer ***@***.***> wrote: That is very interesting as a dataset generation technique, but I am curious to how to ensure there is enough visual variety in the dataset for the model to still maintain the image generation quality across different prompts? Since the underrepresented images will have signifcantly lowered quality due to the model trying to generalize from a small sample. — Reply to this email directly, view it on GitHub <#699 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA2SFN6J7RTK6SUGEY4Y2T2FSQOPAVCNFSM6AAAAABM7ZP4ZGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBTGM2TCOBTGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- 勇気とユーモア

Answer 7 · 2024-12-15T00:29:03.000Z

It would be cool to see a large captioned dataset paired with a structured output for the attributes you would want to control (like the ones you mentioned earlier: camera_z_123, sunlit, sun_distance_25, etc ). The tricky part is still generating enough 3D scenes and avoiding image degradation since diffusion models trained on generated outputs tends to degrade the image quality. Sounds like its a promising approach however.