Reproducibility

Question

Reproducibility

toobaimt opened this issue 4 years ago · 8 comments

I am trying to generate the results that are uploaded under 'ndf/experiments/shapenet_cars_pretrained/evaluation/generation/02958343/' using the provided pre-trained model at the 41st epoch, but the results don't seem to match at all.
Can you kindly specify how many epochs was the model trained for before generating the given results?

Answer 1 · 2021-03-02T14:33:00.000Z

HI @toobaimt ,

thanks for reaching out.

Some things are not yet clear to solve your question:
What do you mean by the results don't match - in what sense do they differ? What results do you obtain?
Are you reporting the non-satisfactory results when using the pretrained model or when training NDF yourself?

On my side, after installing the project (code is inplace, conda environment is set, the data is inplace) I can generate many results with the pretrained model running:

python generate.py --config configs/shapenet_cars_pretrained.txt

Infact this is how I generated the example results in 'ndf/experiments/shapenet_cars_pretrained/evaluation/generation/02958343/'.
The pretrained model can be found in

ndf/experiments/shapenet_cars_pretrained/checkpoints/

and was trained for ~108 hours (if I remember correctly that was on a Tesla V100 GPU).

Best,
Julian

Answer 2 · 2021-03-02T14:58:11.000Z

Hi Julian,

Thank you for the prompt response.

I used the provided pre-trained model (trained for 108 hours, 41 epochs) to generate dense point clouds for the same models that are provided in 'ndf/experiments/shapenet_cars_pretrained/evaluation/generation/02958343/', and they do not look alike. (Examples attached; to the left are the snapshots of your example results, and at the right are the generated models using the provided pre-trained model).

I also tried training from scratch prior to the provision of the pre-trained model and example results, and observed
that NDF starts converging reasonably around the 150th epoch and onwards.

Thanks,
Tooba

Answer 3 · 2021-03-02T15:56:19.000Z

Hi Tooba,

regarding the pre-trained model:

That looks very unexpected. Is this issue occurring at every object you generate or only on some?
Do you use the same command I suggested? Have you altered the code?
I recently updated the repo - maybe clone it again into a novel directory and try from scratch - these issues shouldn't be the case, as I used the same neural network to predict the outputs.

In case this doesn't help although you exactly follow the README instructions, I suspect there is some compatibility issue. In that case, specify your config:
Your OS
Your Hardware (CPU/GPU)
GPU Driver (nvidia-smi / CUDA version)
Your Conda environment with the library versions

Please copy&paste your generation command and its outputs when generating.

When you train an NDF model yourself, is the generation giving you the correct outputs (i.e. without the artifacts you find when you run the pre-trained model)?

You can speed up the training of NDF by setting a higher learning rate (i.e. 1e-4). However, we found this to sometimes seemingly cause issues with training convergence.

Best,
Julian

Answer 4 · 2021-03-02T17:41:36.000Z

Hi Julian,

Thank you for the pointers. I was probably working with an older version of the repo, and I'm able to reproduce the results correctly now. Thanks!

Best,
Tooba

Answer 5 · 2022-09-10T08:26:14.000Z

Hi @toobaimt ,

I also met the same issue as you. But I clone the codes from the main repo. Where did you find the updated repo?

Best,
Wisc

Answer 6 · 2022-09-15T05:20:32.000Z

I also face the same issue after using the provided pretrained model.

I would like to also note that I was not able to run the source code by the default libraries ( Pytorch 1.2 with CUDA 10.2) because it hangs when copying tensors from cpu to cuda, as my GPU is A6000 and it seems not supporting CUDA 10 anymore.
I ended up using Pytorch 1.10 with CUDA 11.3 and then met the artifacts like others reported.

Answer 7 · 2022-09-15T20:24:15.000Z

I tried reducing filer_val at this line and it helps. But reducing too much creates holes in the resulted point cloud.

Answer 8 · 2022-09-19T12:52:32.000Z

Hi @vhvkhoa

I found the bugs. The new Pytorch version replaces the padding param. Maybe you can try to modify them all in local_model.py. Maybe you can try it.

Original code:

self.conv_in = nn.Conv3d(1, 16, 3, padding=1, padding_mode='border')
feature_0 = F.grid_sample(f_0, p, padding_mode='border')

Fixed:

self.conv_in = nn.Conv3d(1, 16, 3, padding=1, padding_mode='zeros')
feature_0 = F.grid_sample(f_0, p, padding_mode='border', align_corners=True)

All code from lines 19-29 and 106-112 should be replaceed. Hope this can help you.