BachiLi/redner

[Windows] Backpropagation does not work

mworchel opened this issue · 23 comments

When running the pose_estimation sample under Windows, the optimization part does not actually perform any optimization. The loss seems to vary randomly and the final estimate does not visually differ from the initial one:

redner_backprop

It seems as if the parameter updates are not correctly computed (or are way too small since there is no visual difference between iterations) for whatever reason.

System:

  • Windows 10 x64
  • Python 3.7.4
  • Redner 0.3.2 (CPU)

I tested the pose_estimation sample in colab in both, CPU and GPU mode. The results are very different:

CPU:
redner_colab_cpu

GPU:
redner_colab_gpu

The CPU mode seems to have some convergence issues in general. Maybe deep down it's somehow related to the Windows backprop issue mentioned above.

@BachiLi Any idea, what could be the cause for the discrepancy?

Interesting. Looking into this.

CPU mode runs fine on my mac...I'm really confused.

It also runs fine on my linux machine. This seems like a Colab-specific issue?

This issue exists on Colab for all redner versions I tested. I have no idea why there is a discrepancy between Colab and my Linux machine.

This issue also exists on the Tensorflow side. Actually the tensorflow version crashes on Colab occasionally.

Typical case of 'but it runs on my machine' :D That is really strange. Maybe has something to do with the type of CPU?

Yes, something is wrong on Colab.

I have a deadline next week and have to work on something else now. Please let me know if you find anything suspicious.

Tested the pose estimation with my (custom) Windows GPU branch which is currently based on redner 0.2.3 and the backpropagation works without issues:

redner_win_gpu

However, the CPU mode fails as above. So it really seems to be some CPU related issue that exists at least since 0.2.3.

I'll keep my eye open. Good luck with your deadline for now!

@mworchel This should be fixed by the commit above (20af170). This is, unsurprisingly, caused by access to uninitialized buffers. In particular the code didn't consider the case where max_bounces=0. Thanks a lot for reporting this and please let me know if this fixes the problem on your side.

That's good news. At least in Colab it seems to work now. However, for Windows it still doesn't work. Same behavior on the CPU as before.

There is a small chance that it's still due to some different initialization behavior of MSVC and GCC/Clang. I also just discovered that I didn't properly port one of the compiler intrinsics. My version of ffs gives the index of the first set MSB, not LSB. I fixed that in PR #104 (and double checked with https://github.com/nemequ/portable-snippets/tree/master/builtin). However, this doesn't fix the backpropagation issue either.

Is there a way to verify the integrity of the edge tree or another way to verify that the stuff relying on intrinsics works the same on all systems?

Hmm. Maybe we can set up some unit tests for the intrinsic. I can potentially do it this weekend. Thank you so much for your time by the way.

@mworchel Have you checked if the atomics are working properly on windows?

Unfortunately, I didn't fully verify that the atomics work. The file atomic_msvc.h is taken from a 3rd party repo (as noted in the comment above the header) and the functions looked reasonable to me. Are the atomic operations only required by the backward pass?

My hope is, that the CPU path is a little bit easier to debug. Would be great if we had some basic tests. Like I said before, it's a great research project and I'm happy I can contribute something :)

Windows machines should be able to pip install redner-gpu and pip install redner from now on. TensorFlow is not supported yet due to some compilation issues (MSVC is not happy with the TensorFlow headers).

Great news! Did you find any new evidence why the CPU backprop didn't work? I suppose most people use the GPU version anyway, so it's not that urgent.

Huh, I thought that was resolved. I'll test on CPU later.

Unfortunately not. Your initialization fix fixed it for colab but not for Windows. I wasn't able to track the issue any further. Hope you'll find something.

Pretty sure there is a bug in atomic add in windows cpu. Should be fixed soon.

0.4.3 should fix this. It was a type conversion issue in the atomics: InterlockedCompareExchange takes integers as arguments and we passed float to them without reinterpret_cast.

Nice catch! Will try it as soon as I can.

Indeed fixed in 0.4.3. Thanks.