intel_assignment
This source code is for the interviewing of INTEL.
running
sh run.sh
or
gcc -std=gnu99 -mavx2 -mfma -mfma4 -fopenmp -lm \
trace.c tensor.c im2col.c conv2d.c pooling.c relu.c main.c \
-o main.o \
&& ./main.o
or
gcc -O3 -std=gnu99 -mavx2 -mfma -mfma4 -fopenmp -lm \
trace.c tensor.c im2col.c conv2d.c pooling.c relu.c main.c \
-o main.o \
&& ./main.o
tensor storage
In my design, every matrix is corresponding to a one-dimensional array, as shown in Figure 1. Given a 3-D matrix with a 3x4x5 shape on the left side, my program stores each element into memory along channel direction, to form a sequential one-dimensional space on the right side of Figure 1. To a 4-D matrix, the storage way almost goes like a 3-D matrix, as shown in Figure 2. I already put the index on the element square, and hope it can help to understand.
Specially, I use only one struct to represent the 3-D matrix and 4-D matrix, as shown in the following code block. It is easy to understand that this struct represents a 3-D matrix. However, in my design, a 4-D matrix is also stored in this struct, which means the 4-D matrix needs to be represented internally. As shown in Figure 3, the OUTSIDE format is what you can see, representing by my Tensor struct, where the shape is [Input Channel, Output Channel, Kernel Size^2], a 3-D dimensional shape. Meanwhile, the internal shape of this 4-D matrix is [Input Channel, Output Channel, Kernel Size, Kernel Size], which needs your imagination. Their storage sequences in memory are exactly the same.
output example
Name: conv2d
Average cycle : 859606612.6
Average second: 3.737420e-01
GFlop : 0.0720
GFlop/s : 0.1926
Name: conv2d_omp
Average cycle : 123192591.4
Average second: 5.356200e-02
GFlop : 0.0720
GFlop/s : 1.3442
Name: conv2d_omp_im2col_locality
Average cycle : 67256949.2
Average second: 2.924215e-02
GFlop : 0.0720
GFlop/s : 2.4620
Name: conv2d_simd_fma_omp_im2col_locality
Average cycle : 55251988.8
Average second: 2.402260e-02
GFlop : 0.0720
GFlop/s : 2.9970
reference
- [1] https://sahnimanas.github.io/post/anatomy-of-a-high-performance-convolution/
- [2] https://github.com/BVLC/caffe
- [3] https://github.com/pytorch/pytorch
- [4] https://software.intel.com/content/www/us/en/develop/articles/a-simple-example-to-measure-the-performance-of-an-intel-mkl-function.html
- [5] https://arxiv.org/pdf/1808.05567.pdf
- [6] https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction
- [7] https://www.intel.com/content/dam/support/us/en/documents/processors/APP-for-Intel-Xeon-Processors.pdf
to-do
Mat Mul with Tilling