Pinned Repositories
abseil-cpp
Abseil Common Libraries (C++)
Anakin
AvxToNeon
In this project, the frequently used AVX instructions are encapsulated as independent modules to reduce repeated development workload.
efficient_transformer
Scaling Transformer architectures has been critical for pushing the frontiers of Language Modelling (LM), a problem central to Natural Language Processing (NLP) and Language Understanding. Although there is a direct positive relationship between the Transformer capacity and its LM performance, there are practical limitations which make training massive models impossible. These limitations come in the form of computation and memory costs which cannot be solely addressed by training on parallel devices. In this thesis, we investigate two approaches which can make Transformers more computationally and memory efficient. First, we introduce the Mixture-of-Experts (MoE) Transformer which can scale its capacity at a sub-linear computational cost. Second, we present a novel content-based sparse attention mechanism called Hierarchical Self Attention (HSA). We demonstrate that the MoE Transformer is capable of achieving lower test perplexity values than a vanilla Transformer model with higher computational demands. Language Modelling experiments, involving a Transformer which uses HSA in place of conventional attention, revealed that HSA can speed up attention computation by up to 330% at a negligible cost in model performance.
jittor
legion
The Legion Parallel Programming System
mixture-of-experts
PyTorch Re-Implementation of "The Sparsely-Gated Mixture-of-Experts Layer" by Noam Shazeer et al. https://arxiv.org/abs/1701.06538
nccl
Optimized primitives for collective multi-GPU communication
nmaker
A heterogeneous cosmological N-Body simulation code for multicore (CPU) and manycore (MIC) platforms.
OpenArray
lowintelligence's Repositories
lowintelligence/abseil-cpp
Abseil Common Libraries (C++)
lowintelligence/Anakin
lowintelligence/AvxToNeon
In this project, the frequently used AVX instructions are encapsulated as independent modules to reduce repeated development workload.
lowintelligence/efficient_transformer
Scaling Transformer architectures has been critical for pushing the frontiers of Language Modelling (LM), a problem central to Natural Language Processing (NLP) and Language Understanding. Although there is a direct positive relationship between the Transformer capacity and its LM performance, there are practical limitations which make training massive models impossible. These limitations come in the form of computation and memory costs which cannot be solely addressed by training on parallel devices. In this thesis, we investigate two approaches which can make Transformers more computationally and memory efficient. First, we introduce the Mixture-of-Experts (MoE) Transformer which can scale its capacity at a sub-linear computational cost. Second, we present a novel content-based sparse attention mechanism called Hierarchical Self Attention (HSA). We demonstrate that the MoE Transformer is capable of achieving lower test perplexity values than a vanilla Transformer model with higher computational demands. Language Modelling experiments, involving a Transformer which uses HSA in place of conventional attention, revealed that HSA can speed up attention computation by up to 330% at a negligible cost in model performance.
lowintelligence/jittor
lowintelligence/legion
The Legion Parallel Programming System
lowintelligence/mixture-of-experts
PyTorch Re-Implementation of "The Sparsely-Gated Mixture-of-Experts Layer" by Noam Shazeer et al. https://arxiv.org/abs/1701.06538
lowintelligence/nccl
Optimized primitives for collective multi-GPU communication
lowintelligence/nmaker
A heterogeneous cosmological N-Body simulation code for multicore (CPU) and manycore (MIC) platforms.
lowintelligence/OpenArray
lowintelligence/Paddle
PArallel Distributed Deep LEarning
lowintelligence/tensorflow
Computation using data flow graphs for scalable machine learning
lowintelligence/TensorRT
TensorRT is a C++ library that facilitates high performance inference on NVIDIA GPUs and deep learning accelerators.
lowintelligence/TePDist
lowintelligence/transformer
A TensorFlow Implementation of the Transformer: Attention Is All You Need
lowintelligence/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs