About depth estimation

Hi, thank you for excellent work!

I noticed that the final point cloud is generated from depths of 4 target views.
My concern is that when using depth to generate point clouds, first instinct should be to build the point cloud from the source/reference view rather than the target view. In this way, it seems like knowing the exam scope and only studying that part.

I wonder what the motivation is for building the point cloud from the target view here. Is it to break the restriction of fixed input view?
if i fix the input views and generate the point cloud from source/reference view, will the model generalization be affected? Or is it that generalization is more affected by the one-to-one correspondence mentioned in paper?

Thanks for your insteret. Good question!

Because our current model setup is given a set of input images, and then get the Gaussian point cloud at the novel(target) view. Therefore, we subsequently chose to use the point cloud from novel views as the initialization.
Another way, which is probably what you're talking about, is that given a set of input images, the model predicts the Gaussian point cloud of input images, and then renders novel views or uses it for subsequent fine-tuning.

Our current model can easily be applied to the second setup as well. But I think it is possible to make some adaptive improvements to the existing model.

If you don't want to adjust the current model, but just want to get a point cloud of input views with our pre-trained model and then do fine tuning, I think that's fine.

Thanks for your reply. It helps a lot, but I am still a little confused. Let me explain my understanding first.
In inference stage, colmap is used to calculate the camera parameters and point cloud, and then run

python run.py --type evaluate --cfg_file configs/mvsgs/colmap_eval.yaml test_dataset.data_root examples/scene1

for rendering, and the point cloud generated by the depth is saved for subsequent optimization. the fusion strategy is using here. In this step, are only the camera parameters used, and the point cloud calculated by COLMAP is not used. The output point cloud is calculated from the depth predicted by the network?

During training, colmap is used to initialize the point cloud.
Only RGB is used to supervise the model, but no depth supervision? So when we train the model from scratch, the forward network is to use cost volume related features to predict the parameters of 3DGS? There is no fusion.py involved?

and also when I run
python train_net.py --cfg_file configs/mvsgs/colmap_eval.yaml train_dataset.data_root examples/scene1 test_dataset.data_root examples/scene1

I found the network return ret, which is a dict.

MVSGaussian/lib/networks/mvsgs/network.py

Line 125 in b9a0fdd

ret.update({key+f'_level{i}': ret_i[key] for key in ret_i})

However
output, loss, loss_stats, image_stats = self.network(batch)

https://github.com/TQTQliu/MVSGaussian/blob/b9a0fdd822dffad73cdcefb5b2120865c1064109/lib/train/trainers/trainer.py#L55C13-L55C72

which forward correspond to self.network?

Our method has two characteristics: 1) A feed-forward network for novel view synthesis. It takes multi-view images and camera parameters as input and output novel views; 2) A point cloud initializer. The feed-forward network can output dense point clouds as initialization for subsequent 3DGS per-scene optimization.

Here are the responses to your questions:

For the per-scene optimization, we use the point cloud produced by the network instead of the point cloud from colmap. Because the point cloud produced by the network is denser and can get better results.
During training, we do not use the point cloud by colmap. Because our model is a feed-forward model, it takes multi-view images and their camera poses as inputs and output novel views, without the need for point clouds.
Only RGB is used to supervise the model, no depth supervision is used.
When training the model from scratch, the forward network uses cost-volume features to predict the parameters of 3DGS, and does not involve fusion.py. Because the fusion is not differentiable.
For self.network, please refer to

MVSGaussian/lib/train/losses/mvsgs.py

Line 64 in b9a0fdd

return output, loss, scalar_stats, image_stats

OK, thanks. This really solved my confusion.

Hi, I share this concern.

Has anyone tested the performance of the 3DGS output from the source views?

If this works, it could be very promising by avoiding the need for per-view predictions.

Additionally, I’m curious why 3D points were used as the initialization instead of 3D Gaussians, which provide more information (e.g., scaling, rotation, etc.). I am looking forward to your reply, @TQTQliu!

Thanks!