zwshan's Stars
Lin-Mao/DrGPUM
A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.
LitLeo/TensorRT_Tutorial
BlinkDL/RWKV-CUDA
The CUDA version of the RWKV language model ( https://github.com/BlinkDL/RWKV-LM )
tgale96/grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
jundaf2/cutlass-b2bgemm
an extension to the cutlass half-precision b2b gemm example
masahi/cutlass_fpA_intB_gemm
cloudcores/CuAssembler
An unofficial cuda assembler, for all generations of SASS, hopefully :)
daadaada/turingas
Assembler for NVIDIA Volta and Turing GPUs
proplot-dev/proplot
🎨 A succinct matplotlib wrapper for making beautiful, publication-quality graphics
baobuiquang/ParallelLSTM
Parallel LSTM training for sequence prediction from sequential data.
salesforce/pytorch-qrnn
PyTorch implementation of the Quasi-Recurrent Neural Network - up to 16 times faster than NVIDIA's cuDNN LSTM
Shubodh/Optimizing-LSTMs-on-GPU
Implementation of the paper "Optimizing Performance of Recurrent Neural Networks on GPUs" in CUDA and OpenMP.
daliansky/minisforum-UM560XT-Hackintosh
Xmingbai/Minisforum-UM560-UM580-APU-Hackintosh
zwshan/libtorch_with_cuda_kernel
libtorch with custom cuda kernel
daliansky/Beelink-SER5-Hackintosh
timvieira/crf
Simple implementation of Conditional Random Fields (CRF) in Python. A faster, more powerful, Cython implementation is available in the vocrf project https://github.com/timvieira/vocrf
taku910/crfpp
CRF++: Yet Another CRF toolkit
lorenlugosch/pytorch_HMM
HMMs in PyTorch
benfred/py-spy
Sampling profiler for Python programs
MrNeRF/cuda_libtorch_googletest_template
This project is a CUDA and libtorch-based template, intended to serve as a starting point for high performance computing and machine learning projects on NVIDIA GPUs. It uses Google Test as a testing framework to test against libtorch.
mtreviso/linear-chain-crf
Implementation of a linear-chain CRF in PyTorch
clearhanhui/LearnLibTorch
LibTorch 中文教程。
chunxxc/lokatt
An open source HMM-DNN nanopore DNA basecaller
teelinsan/parallel-decoding
Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"
aieater/rocm_pytorch_informations
The official page of ROCm/PyTorch will contain information that is always confusing. On this page we will endeavor to describe accurate information based on the knowledge gained by GPUEater infrastructure development.
Liu-xiandong/How_to_optimize_in_GPU
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.
nanoporetech/fast-ctc-decode
Blitzing Fast CTC Beam Search Decoder
alexballas/go2tv
Cast media files to UPnP/DLNA Media Renderers and Smart TVs.
buddy-compiler/buddy-benchmark
Benchmark Framework for Buddy Projects