xiaoyu1004

Pinned Repositories

aurora
Language:C++0 1 00
awesome-ncnn
😎 A Collection of Awesome NCNN-based Projects
0 0 00
basecode
The Basecode compiler toolchain and language workbench.
0 0 00
builder
Continuous builder and binary build scripts for pytorch
Language:Shell0 0 00
caffe
Caffe: a fast open framework for deep learning.
Language:C++0 0 00
Caffe_Code_Analysis
Caffe_Code_Analysis
Language:C++0 0 00
chisel-template
自建 chisel 工程模板
Language:Scala0 0 00
chisel-test
Language:Scala0 1 00
cmake-examples
Useful CMake Examples
Language:CMake0 0 00
dxvk
Vulkan-based implementation of D3D9, D3D10 and D3D11 for Linux / Wine
Language:C++1 0 00

xiaoyu1004's Repositories

xiaoyu1004/aurora
Language:C++0 1 00
xiaoyu1004/chisel-template
自建 chisel 工程模板
Language:Scala0 0 00
xiaoyu1004/chisel-test
Language:Scala0 1 00
xiaoyu1004/conv3DBwdFilter
Language:CMake1 0
xiaoyu1004/ConvolutionBackward
Language:C++1 0
xiaoyu1004/cublas_gemm_benchmark
Language:Cuda1 0
xiaoyu1004/cuda-tensorcore-hgemm
Language:Cuda0 0
xiaoyu1004/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Language:Cuda0 0
xiaoyu1004/cudnnTest
Language:C++1 0
xiaoyu1004/cutlass_test
Language:Cuda
xiaoyu1004/flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
xiaoyu1004/FPGA-DDR-SDRAM
An AXI4-based DDR1 controller to realize mass, cheap memory for FPGA. 基于FPGA的DDR1控制器，为低端FPGA嵌入式系统提供廉价、大容量的存储。
Language:Verilog0 0
xiaoyu1004/FPGA-UART
3 modules: UART receiver, UART transmitter, UART to AXI4 master. 3个模块：UART接收器、UART发送器、UART转AXI4交互式调试器
Language:Verilog0 0
xiaoyu1004/gemm-optimize
optimize gemm
Language:C1 0
xiaoyu1004/gpgpu-simx
a Cycle-Approximate Simulator
Language:C++1 0
xiaoyu1004/how-to-optimize-gemm-cuda
Language:Cuda1 01
xiaoyu1004/how-to-optimize-gemm-in-cpu
A gemm compute library
Language:C++1
xiaoyu1004/how_to_optimize_convolution_in_CPU
how_to_optimize_convolution_in_CPU
Language:C++1 0
xiaoyu1004/How_to_optimize_in_GPU
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.
Language:Cuda0 0
xiaoyu1004/ics-pa
The wrapper repo for NJU ICS PA.
Language:Shell0 0
xiaoyu1004/juliuscblas
a simple blas library
Language:C++1 0
xiaoyu1004/llama2.c
Inference Llama 2 in one file of pure C
xiaoyu1004/mtensor
A C++ Cuda Tensor Lazy Computing Library
Language:C++0 0
xiaoyu1004/NyuziProcessor
GPGPU microprocessor architecture
Language:C0 0
xiaoyu1004/optimize-in-gpu
Language:Cuda1 0
xiaoyu1004/RV32ISC
A RISC-V RV32I ISA Single Cycle CPU
Language:Scala0 0
xiaoyu1004/rvcc
a c programming compiler
Language:C1 0
xiaoyu1004/rvemu
Language:C1 0
xiaoyu1004/rvemu-singlecycle
A single cycle risc-v simulator
Language:C++1 0
xiaoyu1004/VeriGPU
OpenSource GPU, in Verilog, loosely based on RISC-V ISA
Language:SystemVerilog0 0