omlins/ParallelStencil.jl

Disable subnormals inside @parallel blocks

smartalecH opened this issue ยท 8 comments

I'm able to disable subnormals (e.g. set_zero_subnormals(true)) within @parallel_indices blocks, but not parallel blocks. Is there an easy way to get around this? Thanks!

Hi @smartalecH, the parallel function only allows for FiniteDifference{1,2,3}D submodule macros by design to ensure good performance. More advanced features and constructs should currently be implemented in parallel_indices functions as you report.

Good to know, thanks!

Hi @smartalecH, here some additional comments to @luraess ' answer.

First, the error message you obtain when you try to add the line set_zero_subnormals(true) into your kernel gives some more information about why it does not work:

ERROR: LoadError: ArgumentError: unsupported kernel statements in @parallel kernel definition: @parallel is only applicable to kernels that contain exclusively array assignments using macros from FiniteDifferences{1|2|3}D or from another compatible computation submodule. @parallel_indices supports any kind of statements in the kernels.

Second, note that set_zero_subnormals(true) can only give a performance improvement for a compute-bound code, whereas stencil-codes are normally memory-bound. The heat flow code given in the performance tips in the Julia docs only benefits from set_zero_subnormals(true) when the solved problem is so small that it fits in some fast cache (1000 Float32, i. e., ~ 4 KB in the example code; if you increase the size of a in the example, e.g., to 1000^2 you should see that the set_zero_subnormals(true) has no more effect...).

Third, note also that set_zero_subnormals does not work for GPU.

Finally, for improving CPU performance for small problems, we have initiated a backend with LoopVectorization. It might well be that within this effort, we can add support for set_zero_subnormals, one way or another.

Awesome, thanks for the very thorough followup. I have a few responses inline (if interested).

Second, note that set_zero_subnormals(true) can only give a performance improvement for a compute-bound code,

I would disagree... I guess it depends on how you quantify a performance improvement. We have an FDTD code that significantly benefitted from removing subnormal support. The issue was that as the fields ramped up, each step would take an extremely long time to compute. This was problematic when trying to fine tune simulation parameters on different hardware setups (and you only wanted to run for a few timesteps anyway).

Either way I agree that this is a rather niche improvement.

Finally, for improving CPU performance for small problems, we have initiated a backend with LoopVectorization. It might well be that within this effort, we can add support for set_zero_subnormals, one way or another.

Cool! This sounds intriguing. Do you guys have a PR yet that describes the proposed implementation? I'm eager to help if interested.

I have done some prototpyping with LoopVectorization, but there is nothing concrete in terms of integration of this yet. Also, first some refactoring for streamlining backend addition is needed and the addition of an AMDGPU backend has higher priority at the moment. So, it will take a while until we can make it happen...

first some refactoring for streamlining backend addition

Is this goal documented anywhere? I wanted to add Metal support. But I noticed that the backend processing is currently a bit ad hoc, and I also thought some refactoring would be good.

@smartalecH : sorry for the late reply, I have been in vacation. No, this is not documented and the new backend implementations will follow different requirements and ideology than the original one. We will let you know as soon as it is done.

Thanks @omlins! Feel free to reach out if I can help in some aspect.