- Not optimized CPU based matrix multiplication
- optimized for Multi-core CPU based matrix multiplication
- Uses c++ lib
- Utilize Intel based AVX instruction
- Uses AVX (128 bit), AVX2 (256 bit) and AVX-512 (512 bit)
- CUDA based
- ARM64 based
- Apple
- Uses Neon for similar to AVX operations
- Rust based (Not optimized, single core only)