Pinned Repositories
avx_flops
Benchmark cpu flops using avx instructions
cpu_gemm_opt
how to design cpu gemm on x86 with avx256, that can beat openblas.
deepcore_source_code
Subpart source code of of deepcore v0.7
FFT_implement
fft/ifft, r2c/c2r, 2d_r2c/2d_c2r, convolve, correlation, tiling fft, srfft, pfa, radix-2/3/5
gcnasm
amdgpu example code in hip/asm
gemm_implementations
miopen_cudnn_ops
ogl_cube
observe a cube with basic arcball camera in c++
composable_kernel
Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
MISA
Machine Intelligence Shader Autogen. AMDGPU ML shader code generator. (previously iGEMMgen)
carlushuang's Repositories
carlushuang/cpu_gemm_opt
how to design cpu gemm on x86 with avx256, that can beat openblas.
carlushuang/gcnasm
amdgpu example code in hip/asm
carlushuang/avx_flops
Benchmark cpu flops using avx instructions
carlushuang/miopen_cudnn_ops
carlushuang/FFT_implement
fft/ifft, r2c/c2r, 2d_r2c/2d_c2r, convolve, correlation, tiling fft, srfft, pfa, radix-2/3/5
carlushuang/deepcore_source_code
Subpart source code of of deepcore v0.7
carlushuang/gemm_implementations
carlushuang/mkldnn_test
carlushuang/amdgpu-jit
test project for amdgpu codegen
carlushuang/attn_bench
carlushuang/auto_gen
auto gen
carlushuang/binutils-gdb
Unofficial mirror of sourceware binutils-gdb repository. Updated daily.
carlushuang/CWBVH
An implementation of NVIDIA's paper "Efficient Incoherent Ray Traversal on GPUs Through Compressed Wide BVHs"
carlushuang/D3D12nBodyGravity_clang
D3D12nBodyGravity example with clang build
carlushuang/HIP
HIP : Convert CUDA to Portable C++ Code
carlushuang/HIP-Examples
Examples for HIP
carlushuang/hipBLAS
ROCm BLAS marshalling library
carlushuang/hsaco-jit
carlushuang/kernel-launcher-amdgpu
carlushuang/LLVM_Note
carlushuang/Mandelbrot-Set
mandelbrot set
carlushuang/miopen-benchmark
benchmarking miopen
carlushuang/mlir
"Multi-Level Intermediate Representation" Compiler Infrastructure
carlushuang/Paddle
PArallel Distributed Deep LEarning
carlushuang/rocBLAS
Next generation BLAS implementation for ROCm platform
carlushuang/rocm-recipes
Recipes for rocm
carlushuang/Tensile
Stretching GPU performance for GEMMs and tensor contractions.
carlushuang/tsm2x-imp
Implementation of TSM2L and TSM2R -- High-Performance Tall-and-Skinny Matrix-Matrix Multiplication Algorithms for CUDA
carlushuang/tvm_playground
carlushuang/xbyak
a JIT assembler for x86(IA-32)/x64(AMD64, x86-64) MMX/SSE/SSE2/SSE3/SSSE3/SSE4/FPU/AVX/AVX2/AVX-512 by C++ header