zwshan

zwshan's Stars

Lin-Mao/DrGPUM
A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.
Language:Python222
LitLeo/TensorRT_Tutorial
Language:C++992185
BlinkDL/RWKV-CUDA
The CUDA version of the RWKV language model ( https://github.com/BlinkDL/RWKV-LM )
Language:Cuda21235
tgale96/grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
Language:Cuda5440
jundaf2/cutlass-b2bgemm
an extension to the cutlass half-precision b2b gemm example
Language:C++12
masahi/cutlass_fpA_intB_gemm
Language:C++62
cloudcores/CuAssembler
An unofficial cuda assembler, for all generations of SASS, hopefully ：）
Language:Python40572
daadaada/turingas
Assembler for NVIDIA Volta and Turing GPUs
Language:Python20240
proplot-dev/proplot
🎨 A succinct matplotlib wrapper for making beautiful, publication-quality graphics
Language:Python1.1k102
baobuiquang/ParallelLSTM
Parallel LSTM training for sequence prediction from sequential data.
Language:Jupyter Notebook3
salesforce/pytorch-qrnn
PyTorch implementation of the Quasi-Recurrent Neural Network - up to 16 times faster than NVIDIA's cuDNN LSTM
Language:Python1.3k193
Shubodh/Optimizing-LSTMs-on-GPU
Implementation of the paper "Optimizing Performance of Recurrent Neural Networks on GPUs" in CUDA and OpenMP.
Language:C++52
daliansky/minisforum-UM560XT-Hackintosh
6916
Xmingbai/Minisforum-UM560-UM580-APU-Hackintosh
483
zwshan/libtorch_with_cuda_kernel
libtorch with custom cuda kernel
Language:CMake1
daliansky/Beelink-SER5-Hackintosh
18427
timvieira/crf
Simple implementation of Conditional Random Fields (CRF) in Python. A faster, more powerful, Cython implementation is available in the vocrf project https://github.com/timvieira/vocrf
Language:Python342116
taku910/crfpp
CRF++: Yet Another CRF toolkit
Language:Shell506193
lorenlugosch/pytorch_HMM
HMMs in PyTorch
Language:Jupyter Notebook13531
benfred/py-spy
Sampling profiler for Python programs
Language:Rust12.9k431
MrNeRF/cuda_libtorch_googletest_template
This project is a CUDA and libtorch-based template, intended to serve as a starting point for high performance computing and machine learning projects on NVIDIA GPUs. It uses Google Test as a testing framework to test against libtorch.
Language:Cuda5
mtreviso/linear-chain-crf
Implementation of a linear-chain CRF in PyTorch
Language:Python9522
clearhanhui/LearnLibTorch
LibTorch 中文教程。
Language:Python918
chunxxc/lokatt
An open source HMM-DNN nanopore DNA basecaller
Language:Python3
teelinsan/parallel-decoding
Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"
Language:Python1108
aieater/rocm_pytorch_informations
The official page of ROCm/PyTorch will contain information that is always confusing. On this page we will endeavor to describe accurate information based on the knowledge gained by GPUEater infrastructure development.
889
Liu-xiandong/How_to_optimize_in_GPU
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.
Language:Cuda827132
nanoporetech/fast-ctc-decode
Blitzing Fast CTC Beam Search Decoder
Language:Rust17827
alexballas/go2tv
Cast media files to UPnP/DLNA Media Renderers and Smart TVs.
Language:Go52252
buddy-compiler/buddy-benchmark
Benchmark Framework for Buddy Projects
Language:C4839