This repository contains a list of the books, blogs, research papers and white papers that I have read and found interesting.
- AI, DL, NLP and RL
- Calculus
- Computer Architecture
- Computer Graphics
- Data Structures and Algorithms
- Digital Electronics
- Graph Theory
- Information Theory
- Linear Algebra
- Measure Theory
- Optimization Theory
- Probability and Stochastic Processes
- Quantum Computing
- Signal Processing
- 1-bit Adam: communication efficient large-scale training with Adam’s convergence speed
- 5 best practices for efficient model training
- 8-bit approximations for parallelism in deep learning
- 8-bit optimizers via block-wise quantization
- A 'neural' network that learns to play Backgammon
- A BetterTransformer for fast transformer inference
- A deep reinforced model for abstractive summarization
- A dynamical approach to temporal pattern processing
- A few more examples may be worth billions of parameters
- A general and adaptive robust loss function
- A generalist agent
- A gentle introduction to 8-bit matrix multiplication for transformers at scale using Hugging Face transformers, accelerate and bitsandbytes
- A note on the evaluation of generative models
- A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings
- A simple but tough-to-beat baseline for sentence embeddings
- A simple language model for task-oriented dialogue
- A simple neural attentive meta-learner
- A simple neural network module for relational reasoning
- A study of BFLOAT16 for deep learning training
- A style-based generator architecture for generative adversarial networks
- A stylometric inquiry into hyperpartisan and fake news
- A3T: adversarially augmented adversarial training
- Accelerated PyTorch 2 transformers
- Accelerating large language model training with variable sparse pre-training and dense fine-tuning
- Accelerating PyTorch with CUDA graphs
- AdapterHub: a framework for adapting transformers
- Adversarial approximate inference for speech to electroglottograph conversion
- Adversarial autoencoders
- Adversarial examples that fool both computer vision and time-limited humans
- Adversarial feature learning
- Adversarial generation of natural language
- Adversarial information factorization
- Adversarially learned inference
- AlexaTM 20B: few-shot learning using a large-scale multilingual seq2seq model
- Amazon SageMaker model parallelism: a general and flexible framework for large model training
- An image is worth 16x16 words: transformers for image recognition at scale
- An overview of gradient descent optimization algorithms
- Analysing mathematical reasoning abilities of neural models
- Approximation by superpositions of sigmoidal function
- Artificial Intelligence: a modern approach
- Aspect based sentiment analysis with gated convolutional networks
- Attention is all you need
- Attention is off by one
- Auto-encoding variational Bayes
- Backpropagation through the void: optimizing control variates for black-box gradient estimation
- BART: denoising sequence-to-sequence pre-training for natural language generation, translation and comprehension
- Batch normalization: accelerating deep network training by reducing internal covariate shift
- Behavioral cloning from observation
- BERT: pre-training of deep bidirectional transformers for language understanding
- Beyond domain APIs: Task-oriented conversational modeling with unstructured knowledge access
- BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation
- Blockwise parallel transformer for large context models
- BLOOM: A 176B-parameter open-access multilingual language model
- Bootstrapping entity alignment with knowledge graph embedding
- Bridging the gap between prior and posterior knowledge selection for knowledge-grounded dialogue generation
- Bringing open large language models to consumer devices
- BTLM-3B-8K: 7B performance in a 3 billion parameter model
- Building blocks for a complex-valued transformer architecture
- CATS: contextually-aware thresholding for sparsity in large language models
- ChatGPT: optimizing language models for dialogue
- ColBERT: efficient and effective passage search via contextualized late interaction over BERT
- Colossal-AI: a unified deep learning system for large-scale parallel training
- Compiling machine learning programs via high-level tracing
- Complex transformer: a framework for modeling complex-valued sequence
- Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning
- Conditional image synthesis with auxilliary classifier GANs
- Conformal nucleus sampling
- Connecting large language models with evolutionary algorithms yields powerful prompt optimizers
- Connectivity versus entropy
- Constituency parsing with a self-attentive encoder
- Constraint based knowledge base distillation in end-to-end task oriented dialogs
- Context generation improves open domain question answering
- Convert transformers to ONNX with hugging face optimum
- Convolutional networks for graphs for learning molecular fingerprints
- Convolutional neural network language models
- Countering adversarial images using input transformations
- Cramming: training a language model on a single GPU in one day
- Crosslingual generalization through multitask finetuning
- Curriculum learning
- Cutting down on prompts and parameters: simple few-shot learning with language models
- Data engineering for scaling language models to 128K context
- Deep Boltzmann machines
- Deep complex networks
- Deep learning
- Deep learning and the information bottleneck principle
- Deep learning techniques for super-resolution in video games
- Deep residual learning for image recognition
- Deep text classification can be fooled
- DeepSpeed compression: a composable library for extreme compression and zero-cost quantization
- DeepSpeed Inference: enabling efficient inference of transformer models at unprecedented scale
- DeepSpeed powers 8x larger MoE model training with high performance
- DeepSpeed Ulysses: system optimizations for enabling training of extreme long sequence transformer models
- DeepSpeed: accelerating large-scale model inference and training via system optimizations and compression
- DeepSpeed: advancing MoE inference and training to power next-generation AI scale
- Denoising distantly supervised open-domain question answering
- Diffusion convolutional recurrent neural network: data-driven traffic forecasting
- Discrete variational autoencoders
- Disentangling by factorising
- Disentangling language and knowledge in task-oriented dialogs
- Distributionally robust language modeling
- Editing models with task arithmetic
- Efficient estimation of word representations in vector space
- Efficient large scale language modeling with mixtures of experts
- Efficient large-scale language model training on GPU clusters using Megatron-LM
- Enchancing the reliability of out-of-distribution image detection in neural networks
- End-to-end task-oriented dialog modeling with semi-structured knowledge management
- Enhance reasoning for large language models in the game Werewolf
- Ensemble adversarial training: attacks and defenses
- Equilibrium propagation: bridging the gap between energy-based models and backpropagation
- Estimating or propagating gradients through stochastic neurons for conditional computation
- Exemplar encoder-decoder for neural conversation generation
- Expert human-level driving in gran turismo sport using deep reinforcement learning with image-based representation
- Exploring deep recurrent models with reinforcement learning for molecule design
- Exploring the limits of transfer learning with a unified text-to-text transformer
- Extreme compression for pre-trained transformers made simple and efficient
- Fast abstractive summarization with reinforce-selected sentence rewriting
- Fast benchmarking of accuracy vs. training time with cyclic learning rates
- Fast transformer decoding: one write-head is all you need
- Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning
- FFJORD: Free-form continuous dynamics for scalable reversible generative models
- Finetuned language models are zero-shot learners
- Flash-decoding for long-context inference
- FlashAttention: fast and memory-efficient exact attention with IO-awareness
- FlashAttention: fast transformer training with long sequences
- Foundations of NLP explained visually: beam search, how it works
- FP8 formats for deep learning
- FP8-LM: training FP8 large language models
- Gemini: a family of highly capable multimodal models
- Gemma: open models based on Gemini research and technology
- Generating adversarial examples with adversarial networks
- Generating sentences from a continuous space
- Generation-augmented retrieval for open-domain question answering
- Generative adversarial nets
- Generative pretraining from pixels
- Genetic algorithms in search, optimization and machine learning
- GeoMAN: multi-level attention networks for geo-sensory time series prediction
- Getting the most out of the NVIDIA A100 GPU with Multi-Instance GPU
- GLaM: efficient scaling of language models with mixture-of-experts
- GLM-130B: an open bilingual pre-trained model
- GLU variants improve transformer
- Going deeper with convolutions
- GPT-4 architecture, infrastructure, training dataset, costs, vision, MoE
- GPT-NeoX-20B: an open-source autoregressive language model
- GQA: training generalized multi-query transformer models from multi-head checkpoints
- Gradient-based hyperparameter optimization through reversible learning
- Graph attention networks
- Grounding large language models in interactive environments with online reinforcement learning
- Hierarchical neural story generation
- Hindsight: posterior-guided training of retrievers for improved open-ended generation
- HotFlip: white-box adversarial examples for text classification
- How big should my language model be?
- How Pytorch 2.0 accelerates deep learning with operator fusion and CPU/GPU code-generation
- How should AI systems behave, and who should decide?
- How we sped up transformer inference 100x for 🤗 API customers
- How 🤗 Accelerate runs very large models thanks to PyTorch
- Hydragen: high-throughput LLM inference with shared prefixes
- HyKnow: end-to-end task-oriented dialog modeling with hybrid knowledge management
- Hyperparameter search with Transformers and Ray Tune
- Image-to-image translation with conditional generative adversarial networks
- ImageNet classification using deep convolutional neural networks
- Improving entity linking by modeling latent relations between mentions
- Improving language models by retrieving from trillions of tokens
- Improving language understanding by generative pre-training
- Improving reinforcement learning from human feedback with efficient reward model ensemble
- Incredibly fast BLOOM inference with DeepSpeed and Accelerate
- Inference suboptimality in variational autoencoders
- InfoGAN: interpretable representation learning by information maximizing generative adversarial nets
- Interpretable convolutional neural networks via feedforward design
- Introducing MPT-7B: a new standard for open-source, commercially usable LLMs
- Introducing nvFuser, a deep learning compiler for PyTorch
- Introducing Turing image super resolution: AI powered image enhancements for Microsoft Edge and Bing maps
- Introducing 🤗 accelerate
- Is ChatGPT 175 billion parameters? Technical analysis
- Is the future of neural networks Sparse? An introduction (1/N)
- Jack of all trades, master of some, a multi-purpose transformer agent
- Jack of all trades, master of some, a multi-purpose transformer agent
- Joint reasoning on hybrid-knowledge sources for task-oriented dialog
- Judging LLM-as-a-judge with MT-bench and chatbot arena
- Know what you don't know: unanswerable questions for SQuAD
- Knowledge-grounded dialogue generation with pre-trained language models
- Language is not all you need: aligning perception with language models
- Language modeling with gated convolutional networks
- Language modelling with pixels
- Language models (mostly) know what they know
- Language models are unsupervised multitask learners
- Language models as compilers: simulating pseudocode execution improves algorithmic reasoning in language models
- Large language models are not fair evaluators
- Layer normalization
- Learning activation functions to improve deep neural networks
- Learning associative inference using fast weight memory
- Learning discourse-level diversity for neural dialog models using conditional variational autoencoders
- Learning on a general network
- Learning representations by back-propagating errors
- Learning transferable visual models from natural language supervision
- Learning transferable visual models from natural language supervision
- Learning word embeddings efficiently with noise-contrastive estimation
- Leave no context behind: efficient infinite context transformers with infini-attention
- Lessons learned on language model safety and misuse
- Lifelong language pretraining with distribution-specialized experts
- Linear scaling made possible with weight streaming
- Linformer: self-attention with linear complexity
- LLM in a flash: efficient large language model inference with limited memory
- LLM.int8(): 8-bit matrix multiplication for transformers at scale
- Long sequence modeling with XGen: a 7B LLM trained on 8K input sequence length
- LoRA: Low-Rank Adaptation of large language models
- Lost in the middle: how language models use long contexts
- M6-10T: a sharing-delinking paradigm for efficient multi-trillion parameter pretraining
- Machine learning
- Machine learning: a probabilistic perspective
- Making deep learning go brrrr from first principles
- Making DeepSpeed ZeRO run efficiently on more-affordable hardware
- Mask & focus: conversation modelling by learning concepts
- Matryoshka representation learning
- Maximizing communication efficiency for large-scale training via 0/1 Adam
- MCR-DL: mix-and-match communication runtime for deep learning
- MegaBlocks: efficient sparse training with mixture-of-experts
- Megatron-LM: training multi-billion parameter language models using model parallelism
- Memory-efficient pipeline-parallel DNN training
- MinTL: minimalist transfer learning for task-oriented dialogue systems
- Mix and match: learning-free controllable text generation using energy language models
- Mixed precision training
- Mixture of attention heads: selecting attention heads per token
- Mixture-of-Experts meets instruction tuning: a winning combination for large language models
- mixup: beyond empirical risk minimization
- MMCoQA: conversational question answering over text, tables and images
- Mode matching in GANs through latent space learning and inversion
- Multi-level memory for task oriented dialogs
- Multitask prompt tuning enables parameter-efficient transfer learning
- MultiWOZ - A large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling
- Mutual information neural estimation
- NeMo: a toolkit for building AI applications using neural modules
- Neural GPUs learn algorithms
- Neural network methods for natural language processing
- Neural networks and physical systems with emergent collective computational abilities
- Neural networks for pattern recognition
- Neural ordinary differential equations
- No train no gain: revisiting efficient training algorithms for transformer-based language models
- Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples
- OctoPack: instruction tuning code large language models
- On the convergence of Adam and beyond
- On the power of neural networks for solving hard problems
- One model to learn them all
- Open domain question answering over tables via dense retrieval
- Open question answering over tables and text
- OPT: open pre-trained transformer language models
- Optimal brain compression: a framework for accurate post-training quantization and pruning
- Optimal perceptual inference
- Optimization story: Bloom inference
- Orca 2: teaching small language models how to reason
- Orca: progressive learning from complex explanation traces of GPT-4
- Outer product-based neural collaborative filtering
- Outrageously large neural networks: the sparsely-gated mixture-of-experts layer
- Overcoming oscillations in quantization-aware training
- PAL: Program-aided language models
- PaLM: scaling language modeling with pathways
- Parallel context windows improve in-context learning of large language models
- Pattern classification
- Pattern recognition and machine learning
- Perceptual losses for real-time style transfer and super-resolution
- Personalizing dialogue agents: I have a dog, do you have pets too?
- Phase-functioned neural networks for character control
- Playing Atari with deep reinforcement learning
- Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing
- Prefix-tuning: optimizing continuous prompts for generation
- Probabilistic latent semantic analysis
- Progressive growing of GANs from improved quality, stability and variation
- Prompting with pseudo-code instructions
- Proximal policy optimization algorithms
- PullNet: open domain question answering with iterative retrieval on knowledge bases and text
- PyTorch trace analysis for the masses
- Q-BERT: Hessian based ultra low precision quantization of BERT
- R3Net: recurrent residual refinement network for saliency detection
- Reading Wikipedia to answer open-domain questions
- REALM: Retrieval-augmented language model pretraining
- Recurrent models of visual attention
- Reducing activation recomputation in large transformer models
- Regularizing and optimizing LSTM language models
- Reinforcement Learning: An Introduction
- ReLoRA: high-rank training through low-rank updates
- Restricted Boltzmann machines for collaborative filtering
- Retrieval augmentation reduces hallucination in conversation
- Retrieval-augmented generation for knowledge-intensive NLP tasks
- Revisiting classifier two-sample tests
- RoBERTa: a robustly optimized BERT pretraining approach
- RoFormer: enhanced transformer with rotary position embedding
- SantaCoder: don't reach for the stars!
- Scaling instruction-finetuned language models
- Scaling PyTorch FSDP for training foundation Models on IBM cloud
- Scaling transformer to 1M tokens and beyond with RMT
- Self-instruct: aligning language model with self generated instructions
- Self-normalizing neural networks
- Semantically equivalent adversarial rules for debugging NLP models
- Seq2seq model and the exposure bias problem
- Sequence parallelism: long sequence training from system perspective
- Sequential latent knowledge selection for knowledge-grounded dialogue
- Simple and effective multi-paragraph reading comprehension
- Simplifying transformer blocks
- SlimPajama-DC: understanding data combinations for LLM training
- SmoothQuant: accurate and efficient post-training quantization for large language models
- Soft filter pruning for accelerating deep convolutional neural networks
- SOLAR 10.7B: scaling large language models with simple yet effective depth up-scaling
- SOLOIST: building task bots at scale with transfer learning and machine teaching
- Solving quantitative reasoning problems with language models
- Spatial temporal graph convolutional networks for skeleton-based action recognition
- Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting
- Spectral normalization for generative adversarial networks
- Speech and language processing
- StarCoder: may the source be with you!
- Sticking the landing: simple, lower-variance gradient estimators for variational inference
- StitchNet: composing neural networks from pre-trained fragments
- Stochastic hyperparameter optimization through hypernetworks
- Strategies for teaching layered networks classification tasks
- Structured prompting: scaling in-context learning to 1,000 examples
- Style transfer from non-parallel text by cross-alignment
- Subword regularization: improving neural network translation models with multiple subword candidates
- Supervised learning of probability distributions by neural networks
- Supporting efficient large model training on AMD InstinctTM GPUs with DeepSpeed
- Switch transformers: scaling to trillion parameter models with simple and efficient sparsity
- Synchronization in neural nets
- Synthetic data (almost) from scratch: generalized instruction tuning for language models
- Tackling the poor assumptions of Naive Bayes text classifiers
- Tensor programs V: tuning large neural networks via zero-shot hyperparameter transfer
- TextWorld: a learning environment for text-based games
- The best of both worlds: combining recent advances in neural machine translation
- The elements of statistical learning: data mining, inference and prediction
- The Flan collection: designing data and methods for effective instruction tuning
- The information bottleneck method
- The Pile: an 800GB dataset of diverse text for language modeling
- The power of scale for parameter-efficient prompt tuning
- The wisdom of hindsight makes language models better instruction followers
- Thermometer encoding: one hot way to resist adversarial examples
- To regularize or not to regularize? The bias variance trade-off in regularized AEs
- Towards crowdsourced training of large neural networks using decentralized mixture-of-experts
- Towards deep learning models resilient to adversarial attacks
- Towards evaluating the robustness of neural networks
- Train short, test long: Attention with linear biases enables input length extrapolation
- Training compute-optimal large language models
- Training language models to follow instructions with human feedback
- Transformer memory as a differentiable search index
- Transformer quality in linear time
- Transformer-XL: attentive language models beyond a fixed-length context
- Transformers explained visually (part 1): overview of functionality
- Transformers explained visually (part 2): how it works, step-by-step
- Transformers explained visually (part 3): multi-head attention, deep dive
- Turing-NLG: a 17-billion-parameter language model by Microsoft
- UL2: unifying language learning paradigms
- Understanding convolutional neural networks with a mathematical model
- Understanding disentangling in β-VAE
- Understanding the Open Pre-Trained Transformers (OPT) library
- Unit tests for stochastic optimization
- Universal language model fine-tuning for text classification
- Unlimiformer: long-range transformers with unlimited length input
- Unpaired image-to-image translation using cycle-consistent adversarial networks
- Unsupervised machine translation using monolingual corpora only
- Unsupervised representation learning by predicting image rotations
- Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, the world’s largest and most powerful generative language model
- Variational inference using implicit distributions
- Variational inference with latent space quantization for adversarial resilience
- Variational learning for unsupervised knowledge grounded dialogs
- Variational lossy autoencoder
- Vector-quantized input-contextualized soft prompts for natural language understanding
- VEEGAN: reducing mode collapse in GANs using implicit variational learning
- Very deep convolutional networks for large-scale image recognition
- Visual instruction tuning
- Visualizing data using t-SNE
- Wasserstein GAN
- wav2vec 2.0: a framework for self-supervised learning of speech representations
- Wavenet: a generative model for raw audio
- WebGPT: browser-assisted question-answering with human feedback
- What language model to train if you have one million GPU hours?
- Will GPT-4 run DOOM?
- Word translation without parallel data
- Yandex publishes YaLM 100B. It’s the largest GPT-like neural network in open source
- You only cache once: decoder-decoder architectures for language models
- You only look once: unified, real-time object detection
- ZeRO & DeepSpeed: new system optimizations enable training models with over 100 billion parameters
- ZeRO++: Extremely efficient collective communication for giant model training
- ZeRO-2 & DeepSpeed: shattering barriers of deep learning speed & scale
- ZeRO-Infinity: breaking the GPU memory wall for extreme scale deep learning
- Zero-shot text-to-image generation
- ZeRO: memory optimizations toward training trillion parameter models
- ZeroQuant: efficient and affordable post-training quantization for large-scale transformers
- β-VAE: learning basic visual concepts with a constrained variational framework
- Accelerated computing with a reconfigurable dataflow architecture
- Computer architecture: a quantitative approach
- Computer organization and design ARM edition: the hardware software interface
- Flipping bits in memory without accessing them: an experimental study of DRAM disturbance errors
- Improving DRAM performance by parallelizing refreshes with accesses
- Memory performance attacks: denial of memory service in multi-core systems
- Memory scaling: a systems architecture perspective
- Millicode in an IBM zSeries processor
- MTIA v1: Meta's first-generation AI inference accelerator
- RAIDR: Retention-Aware Intelligent DRAM Refresh
- Stall-time fair memory access scheduling for chip multiprocessors
- Elements of information theory
- Error detecting and error correcting codes
- Convex Optimization
- Distributed optimization and statistical learning via the alternating direction method of multipliers
- A fast quantum mechanical algorithm for database search
![image][Paper] ![image][Quantum Algorithms] ![image][Quantum Computing] - A single quantum cannot be cloned
![image][Paper] ![image][Quantum Computing] - Can quantum-mechanical description of physical reality be considered complete
![image][Paper] ![image][Quantum Computing] - Image recognition with an adiabatic quantum computer I. mapping to quadratic unconstrained binary optimization
![image][Paper] ![image][Image Classification] ![image][QUBO] ![image][Quantum Computing] - Integer optimization toolbox (minimizing polynomials over integer lattices using quantum annealing)
![image][Whitepaper] - Limits on parallel speedup for classical Ising model solvers
![image][Whitepaper] - Partitioning optimization problems for hybrid classical/quantum execution
![image][Whitepaper] - Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer
![image][Paper] ![image][Quantum Algorithms] ![image][Quantum Computing] - Probabilistic cloning and identification of linearly independent quantum states
![image][Paper] ![image][Cloning] ![image][Quantum Computing] - Programming with D-Wave: map coloring problem
![image][Whitepaper] - Quantum computation and quantum information
![image][Book] - Quantum computing: a gentle introduction
![image][Book] - Quantum performance evaluation: a short reading list
![image][Whitepaper] - Quantum theory, the Church-Turing principle and the universal quantum computer
![image][Paper] ![image][Quantum Computing] ![image][Theory of Computation] - Rapid solution of problems by quantum computation
![image][Paper] ![image][Quantum Algorithms] ![image][Quantum Computing] - Teleporting an unknown quantum state via dual classical and Einstein-Podolsky-Rosen channels
![image][Paper] ![image][Quantum Computing] ![image][Quantum Teleportation]
- Discrete-time signal processing
![image][Book] - Foundations of Signal Processing
![image][Book] - Signals and systems
![image][Book] - Understanding digital signal processing
![image][Book]