[Question] Training a camera position controlnet ?
arthurwolf opened this issue · 7 comments
Hello!
Thanks for the amazing project.
I'm often in the situation where I've generated a scene I really like, but I'd like to rotate the camera a bit more to the right, or zoom in, or put the camera a bit higher up, etc.
Currently, the only way I found to do this would be to generate a 3D model of the scene (possibly automatically from a controlnet-generated depth map?), rotate that, generate a new depth map, and use that to regenerate the map.
But:
-
- This is cumbersome/slow, and
-
- This would only let me move the camera small amounts at a time
Another option somebody suggested, was training LORAs on specific angles, and have many LORAs for many different angles/camera position. But again, pretty cumbersome (and a lot of training). Also, not even sure if that'd work.
Or train a single LORA, but with a dataset that matches many different angle "keywords" to many different positioned images? As you can see, I'm a bit lost.
I figured what I really want to do, is manipulate the part of the model's "internal conception" of the scene that defines its rotation (if there is such a thing...). Like there has to be some set of weights that defines if we look at a subject from the front or the back, if a face is seen sideways or 3 quarters, etc.
So my question is, would it be possible to create a controlnet that would do this?
My main problem I see, is controlnet training, as described in
https://github.com/lllyasviel/ControlNet/blob/main/docs/train.md
takes images as an input.
But the input in my case wouldn't be an image, it would be an angle.
So my best guess of how to do this (and this is likely completely wrong), would be:
- Take a 3D scene.
- Render it at a specific angle/zoom/camera position.
- Take that generated image, and a text description of the camera position : angle212 height1.54 etc. Or maybe (angle 0.3) (height 0.25) ? I.e. play on the strength of the tokens? Something like that.
- Add each pair of generated image and corresponding position text to the dataset (completely ignoring the "black and white" input image)
- Generate thousands, train.
grocery store, shop, vector graphic, rotation-0.5, height-0.5, distance-0.3, sunrotation-0.2, sunheight-0.5, sundistance-1.0,
Would this work? Does it have any chance to? If not, is there any way to do this that would work?
Would a single 3D scene (or even a dumb cube on a plane) work, or do I need a large variety of scenes?
I would love some kind of input/feedback/advice on this.
Thanks so much to anyone who takes the time to reply.
Cheers.
@arthurwolf This is most likely not possible with the ControlNet implementation in this repo since it relies on converting the guidance into an image that the UNet can understand, it not really make sense to convert camera positions data such as rotation and position into a image format like you mentioned. This seems like it would make more sense to generate an image then control the camera position with NeRF (neural radiance fields) or a image-to-3D model like https://huggingface.co/stabilityai/TripoSR. Directly using the diffusion model to control camera angles would probably require some kind of text conditioning on camera position data to guide the model.
That's similar to creating a dataset for a NeRF model, which is very challenging since it requires quite a massive amount of high quality 3D renders (you don't want the diffusion model's image quality to degrade from conditioning on low quality renders) with a variety of novel views for them. I think it would be easier to add the positions as a caption from a VLM or a camera position estimator (OpenCV and such) on an existing text-to-image dataset (LAION-2B for example) rather than generating novel views from 3D renders. Which would require an immense number of 3D scenes and objects dataset (Objectverse dataset or something similar) or scenes to prevent overfitting but even then model could overfit to the actual visual information of the 3D scene rather than the camera angles you would want it to be conditioned on (remember the diffusion model has absolutely no idea what you are trying to make it learn, it is simply learning patterns in the data that are the most prominent) .
That is very interesting as a dataset generation technique, but I am curious to how to ensure there is enough visual variety in the dataset for the model to still maintain the image generation quality across different prompts? Since the underrepresented images will have signifcantly lowered quality due to the model trying to generalize from a small sample.
It would be cool to see a large captioned dataset paired with a structured output for the attributes you would want to control (like the ones you mentioned earlier: camera_z_123, sunlit, sun_distance_25, etc ). The tricky part is still generating enough 3D scenes and avoiding image degradation since diffusion models trained on generated outputs tends to degrade the image quality. Sounds like its a promising approach however.
