Disabling OpenMP parallel pragma for CPU tensors causes performance regression
alextnewman opened this issue · 1 comments
alextnewman commented
The removal of OpenMP from this tensor_cpu_inl.h caused a massive performance regression for us on Windows (MSVC 2013), Mac (Clang), and Linux (gcc): f225763
Locally, we've reverted this commit and gotten a tremendously positive result (20%+ improvement in training time), so it would be very helpful if there were some sort of option or flag we could use to enable OpenMP parallelization for this function without internal forking.