FFF shows promise to exponentially reduce the compute-power required by a feed-forward neural network layer, while retaining most of the neural-computer power.
The purpose of this repo is to play with FastFeedForward Networks.
We plan to engineer a performant implementation.
Also as the innovation is new, we want to explore & tweak, rather than simply dive 100% into optimizations.
Chat with us on Discord
NOTE: We're not affiliated with the authors of the FFF papers, but we'd be thrilled if they were to pop in and say hi!
-
We have provided a re-conceptualization of this innovation in doc/theory.md. At the heart is the idea of dynamically choosing (per sample input) a most-appropriate-basis-pair (basis in INPUT space and basis in OUTPUT space), approximating our input x as a linear combination of X-basis vectors, and projecting into OUTPUT-space by applying these coefficients to our OUTPUT-space vectors. The basis is found by traversing a binary tree, where each node contains a X,Y pair of basis vectors.
-
We've rewritten the core code fff.py and it boils down to half a dozen lines of PyTorch/einsum. There's a
for
loop (for traversing the binary-tree) so naive solution is extremely non-optimized. We've treaked the weight-initialization and shot for a simpler target than do the original papers. -
We benchmark FFF against standard PyTorch FF (standard FeedForward layer). The first benchmark shows that for small layers FF wins, but as we increase the layer-size FFF starts to outperform FF. e.g. setting nIn = nOut = 2^14, FFF is already performing at 20x speed.
-
Next we check that a FFF layer is actually learning. We create a simple CIFAR10 classifier NN (here) and replace the FF layers with FFF. We find that after 5 epochs FF has achieved ~52% accuracy whereas FFF has achieved ~48%. So FFF trains.
- Tweaking, tinkering, benchmarking, analyzing learning, exploring
- Creating CPU and GPU/CUDA optimized implementations
- Exploring how we can use this innovation in other architectures (CNN, Attention, etc.) and whether it leads to novel architectures.
-
(18 Sep 2023)Fast Feedforward Networks (code)
-
(15 Nov 2023)Exponentially Faster Language Modelling (code)
Second revision of paper has updated repo here containing CUDA code.
2023.11.23
-
π created pbelcak/UltraFastBERT#1
Observing the BERT benchmark performs slower than the vanilla BERT on HF -
π created pbelcak/UltraFastBERT#2
An interpretation of the core algorithm, and a suggestion for improvement (remove the .gelu)
Links to a gist demo of FFF operating over MNIST.