[Question] Camera Position Adapter ?

Question

[Question] Camera Position Adapter ?

arthurwolf opened this issue 4 months ago · 2 comments

Hello!

Thanks for the amazing project.

I'm often in the situation where I've generated a scene I really like, but I'd like to rotate the camera a bit more to the right, or zoom in, or put the camera a bit higher up, etc.

Currently, the only way I found to do this would be to generate a 3D model of the scene (possibly automatically from a controlnet-generated depth map?), rotate that, generate a new depth map, and use that to regenerate the map.

But:

1. This is cumbersome/slow, and
1. This would only let me move the camera small amounts at a time

Another option somebody suggested, was training LORAs on specific angles, and have many LORAs for many different angles/camera position. But again, pretty cumbersome (and a lot of training). Also, not even sure if that'd work.

Or train a single LORA, but with a dataset that matches many different angle "keywords" to many different positioned images? As you can see, I'm a bit lost.

I figured what I really want to do, is manipulate the part of the model's "internal conception" of the scene that defines its rotation (if there is such a thing...). Like there has to be some set of weights that defines if we look at a subject from the front or the back, if a face is seen sideways or 3 quarters, etc.

So my question is, would it be possible to create a controlnet/adapter that would do this?

My main problem I see, is controlnet/adapter training, as far as I know, takes those black and white control images as an input.

But the input in my case wouldn't be an image, it would be an angle.

So my best guess of how to do this (and this is likely completely wrong), would be:

Take a 3D scene.
Render it at a specific angle/zoom/camera position.
Take that generated image, and a text description of the camera position : angle212 height1.54 etc. Or maybe (angle 0.3) (height 0.25) ? I.e. play on the strength of the tokens? Something like that.
Add each pair of generated image and corresponding position text to the dataset (completely ignoring the "black and white" input image)
Generate thousands, train.

grocery store, shop, vector graphic, rotation-0.5, height-0.5, distance-0.3, sunrotation-0.2, sunheight-0.5, sundistance-1.0,

Would this work? Does it have any chance to? If not, is there any way to do this that would work?

Would a single 3D scene (or even a dumb cube on a plane) work, or do I need a large variety of scenes?

I would love some kind of input/feedback/advice on this.

Thanks so much to anyone who takes the time to reply.

Cheers.