[Windows] Backpropagation does not work

Question

[Windows] Backpropagation does not work

mworchel opened this issue 4 years ago · 23 comments

When running the pose_estimation sample under Windows, the optimization part does not actually perform any optimization. The loss seems to vary randomly and the final estimate does not visually differ from the initial one:

It seems as if the parameter updates are not correctly computed (or are way too small since there is no visual difference between iterations) for whatever reason.

System:

Windows 10 x64
Python 3.7.4
Redner 0.3.2 (CPU)

Answer 1 · 2020-01-15T10:48:36.000Z

I tested the pose_estimation sample in colab in both, CPU and GPU mode. The results are very different:

CPU:

GPU:

The CPU mode seems to have some convergence issues in general. Maybe deep down it's somehow related to the Windows backprop issue mentioned above.

@BachiLi Any idea, what could be the cause for the discrepancy?

Answer 2 · 2020-01-15T15:12:43.000Z

Interesting. Looking into this.

Answer 3 · 2020-01-15T15:34:28.000Z

CPU mode runs fine on my mac...I'm really confused.

Answer 4 · 2020-01-15T15:42:05.000Z

It also runs fine on my linux machine. This seems like a Colab-specific issue?

Answer 5 · 2020-01-15T15:58:10.000Z

This issue exists on Colab for all redner versions I tested. I have no idea why there is a discrepancy between Colab and my Linux machine.

Answer 6 · 2020-01-15T16:02:37.000Z

This issue also exists on the Tensorflow side. Actually the tensorflow version crashes on Colab occasionally.

Answer 7 · 2020-01-15T16:19:50.000Z

Typical case of 'but it runs on my machine' :D That is really strange. Maybe has something to do with the type of CPU?

Answer 8 · 2020-01-15T16:20:40.000Z

Yes, something is wrong on Colab.

Answer 9 · 2020-01-15T16:28:56.000Z

I have a deadline next week and have to work on something else now. Please let me know if you find anything suspicious.

Answer 10 · 2020-01-15T21:57:07.000Z

Tested the pose estimation with my (custom) Windows GPU branch which is currently based on redner 0.2.3 and the backpropagation works without issues:

However, the CPU mode fails as above. So it really seems to be some CPU related issue that exists at least since 0.2.3.

I'll keep my eye open. Good luck with your deadline for now!

Answer 11 · 2020-02-02T22:58:20.000Z

@mworchel This should be fixed by the commit above (20af170). This is, unsurprisingly, caused by access to uninitialized buffers. In particular the code didn't consider the case where max_bounces=0. Thanks a lot for reporting this and please let me know if this fixes the problem on your side.

Answer 12 · 2020-02-03T10:43:29.000Z

That's good news. At least in Colab it seems to work now. However, for Windows it still doesn't work. Same behavior on the CPU as before.

There is a small chance that it's still due to some different initialization behavior of MSVC and GCC/Clang. I also just discovered that I didn't properly port one of the compiler intrinsics. My version of ffs gives the index of the first set MSB, not LSB. I fixed that in PR #104 (and double checked with https://github.com/nemequ/portable-snippets/tree/master/builtin). However, this doesn't fix the backpropagation issue either.

Is there a way to verify the integrity of the edge tree or another way to verify that the stuff relying on intrinsics works the same on all systems?

Answer 13 · 2020-02-03T18:48:56.000Z

Hmm. Maybe we can set up some unit tests for the intrinsic. I can potentially do it this weekend. Thank you so much for your time by the way.

Answer 14 · 2020-02-03T19:53:42.000Z

@mworchel Have you checked if the atomics are working properly on windows?

Answer 15 · 2020-02-04T18:59:36.000Z

Unfortunately, I didn't fully verify that the atomics work. The file atomic_msvc.h is taken from a 3rd party repo (as noted in the comment above the header) and the functions looked reasonable to me. Are the atomic operations only required by the backward pass?

My hope is, that the CPU path is a little bit easier to debug. Would be great if we had some basic tests. Like I said before, it's a great research project and I'm happy I can contribute something :)

Answer 16 · 2020-03-17T16:52:36.000Z

Windows machines should be able to pip install redner-gpu and pip install redner from now on. TensorFlow is not supported yet due to some compilation issues (MSVC is not happy with the TensorFlow headers).

Answer 17 · 2020-03-18T16:03:28.000Z

Great news! Did you find any new evidence why the CPU backprop didn't work? I suppose most people use the GPU version anyway, so it's not that urgent.

Answer 18 · 2020-03-18T16:31:47.000Z

Huh, I thought that was resolved. I'll test on CPU later.

Answer 19 · 2020-03-18T17:27:12.000Z

Unfortunately not. Your initialization fix fixed it for colab but not for Windows. I wasn't able to track the issue any further. Hope you'll find something.

Answer 20 · 2020-03-18T23:08:43.000Z

Pretty sure there is a bug in atomic add in windows cpu. Should be fixed soon.

Answer 21 · 2020-03-19T00:52:45.000Z

0.4.3 should fix this. It was a type conversion issue in the atomics: InterlockedCompareExchange takes integers as arguments and we passed float to them without reinterpret_cast.

Answer 22 · 2020-03-21T16:58:34.000Z

Nice catch! Will try it as soon as I can.

Answer 23 · 2020-03-25T19:32:32.000Z

Indeed fixed in 0.4.3. Thanks.