Pinned Repositories
aurora
awesome-ncnn
😎 A Collection of Awesome NCNN-based Projects
basecode
The Basecode compiler toolchain and language workbench.
builder
Continuous builder and binary build scripts for pytorch
caffe
Caffe: a fast open framework for deep learning.
Caffe_Code_Analysis
Caffe_Code_Analysis
chisel-template
自建 chisel 工程模板
chisel-test
cmake-examples
Useful CMake Examples
dxvk
Vulkan-based implementation of D3D9, D3D10 and D3D11 for Linux / Wine
xiaoyu1004's Repositories
xiaoyu1004/aurora
xiaoyu1004/chisel-template
自建 chisel 工程模板
xiaoyu1004/chisel-test
xiaoyu1004/conv3DBwdFilter
xiaoyu1004/ConvolutionBackward
xiaoyu1004/cublas_gemm_benchmark
xiaoyu1004/cuda-tensorcore-hgemm
xiaoyu1004/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
xiaoyu1004/cudnnTest
xiaoyu1004/cutlass_test
xiaoyu1004/flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
xiaoyu1004/FPGA-DDR-SDRAM
An AXI4-based DDR1 controller to realize mass, cheap memory for FPGA. 基于FPGA的DDR1控制器,为低端FPGA嵌入式系统提供廉价、大容量的存储。
xiaoyu1004/FPGA-UART
3 modules: UART receiver, UART transmitter, UART to AXI4 master. 3个模块:UART接收器、UART发送器、UART转AXI4交互式调试器
xiaoyu1004/gemm-optimize
optimize gemm
xiaoyu1004/gpgpu-simx
a Cycle-Approximate Simulator
xiaoyu1004/how-to-optimize-gemm-cuda
xiaoyu1004/how-to-optimize-gemm-in-cpu
A gemm compute library
xiaoyu1004/how_to_optimize_convolution_in_CPU
how_to_optimize_convolution_in_CPU
xiaoyu1004/How_to_optimize_in_GPU
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.
xiaoyu1004/ics-pa
The wrapper repo for NJU ICS PA.
xiaoyu1004/juliuscblas
a simple blas library
xiaoyu1004/llama2.c
Inference Llama 2 in one file of pure C
xiaoyu1004/mtensor
A C++ Cuda Tensor Lazy Computing Library
xiaoyu1004/NyuziProcessor
GPGPU microprocessor architecture
xiaoyu1004/optimize-in-gpu
xiaoyu1004/RV32ISC
A RISC-V RV32I ISA Single Cycle CPU
xiaoyu1004/rvcc
a c programming compiler
xiaoyu1004/rvemu
xiaoyu1004/rvemu-singlecycle
A single cycle risc-v simulator
xiaoyu1004/VeriGPU
OpenSource GPU, in Verilog, loosely based on RISC-V ISA