a stream of interesting papers read or to be read
-
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits: LLM is ternary {-1, 0, 1}, BitNet b 1.58. Resource use and performance benefits. Could open up new avenues for arch design
-
Orca 2: Teaching Small Language Models How to Reason: still fun to plug the questions into newer models. E.g. Mistral large, which just cruises through them.
-
Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations: THE original Triton paper. So modest. But all your base is belong to us. One script to rule them all, and in the tensor to bind them. https://github.com/openai/triton https://triton-lang.org/main/index.html
-
The Cache Performance and Optimizations of Blocked Algorithms: Comprehensive study of blocking algorithms and how they can take advantage of memory hierarchies. Referenced in the Triton literature
-
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning: The Attention layers scale quadratically in compute and memory with context length. Tri Dao's work improves this by rescheduling with better work partitioning to get device util up to ~70%. Has been implemented in Triton.
-
Ternary optical computer principle: 3 states of light (horizontal & vertical polarization and off) are used in this proposal for a ternary computer (with all the advantages that ternary logic provides over binary https://en.wikipedia.org/wiki/Balanced_ternary). Read in conjuction with 1.
-
Hegel 2.0 The imaginary history of ternary computing: Amusing article in Cabinet
-
The Sky Above The Clouds: beyond cloud computing with intercloud brokers - an imagination of the evolution of cloud computing
-
How to use Transformer Networks to build a Forecasting model: basic intro with code in PyTorch: https://github.com/CVxTz/time_series_forecasting
-
TRANSFORMERS IN TIME-SERIES ANALYSIS: A TUTORIAL Several enhancements to the initial Transformer architecture, also best practices and techniques to overcome the challenge of effectively training Transformers for time-series analysis.
-
Deep Transformer Models for Time Series Forecasting: The Influenza Prevalence Case
-
Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting: recurrent layers for local processing and interpretable self-attention layers for long-term dependencies. TFT utilizes specialized components to select relevant features and a series of gating layers to suppress unnecessary components (google-research code)
-
Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting: LogSparse Transformer with only O(L(log L)2) memory cost