GEMM optimization practices

$ cd cpu
$ ./batch_run.sh
$ python ./plot_all.py

10~20 times faster than the orignial version(compiled by gcc with -O2 flag)

monklof/hpc_demos