Referenced from: cuda-tutorial TODO: Simple matrix multiply (TODO: check error constraints) Matrix multiply large matrix Matrix multiply on tensor core Matrix multiply optimize L2 cache hits.