A survey of improving performance
hikettei opened this issue · 0 comments
In terms of training time and memory usage, cl-waffe2 has a lot of challenges. In fact, even in the case of training simple MLP, cl-waffe2 is even 1.5 times slower than the same operations in PyTorch
. However, this is because cl-waffe2 is a JIT compilation-based framework and I've only started this project a few months ago. It still has a large number of potential optimization. The next term goals are to optimize training time, So here's a list of things to be optimized:
cl-waffe2 IR
Graph-level optimization is still not enough. Especially, the number of MoveTensorNode
should be reduced.
FuseOps
FuseOps
Supporting is still poor. In the future, I want to create search-based instruction fusion. For example, users define the sequence of IR to be replaced with a (defpath ...)
macro, and the compiler reads it.
The full use of SIMD Ops
・Use SLEEF
The full use of lparallel
Maximum speed-up can be achieved by putting all data on SIMD registers and then parallelising by lparallel.