NLP

llama:

https://arxiv.org/pdf/2307.09288.pdf: Llama 2: Open Foundation and Fine-Tuned Chat Models

QLORA:

https://arxiv.org/pdf/2305.14314.pdf: QLORA: Efficient Finetuning of Quantized LLMs

Computation Data Type: Usually BFloat16
Low-Precision Storage Data Type: 4-bit
Reduces memory usage for finetuning a 65B model on a single 48GB GPU while maintaining full 16-bit fine-tuning task performance.
NF4 (4-bit NormalFloat): Optimal for normally distributed weights.
Double quantization reduces average memory footprint.
Paged optimizers manage memory spikes.

Paged optimizers: Prevent memory spikes we abruptly get a really long input, especially when we're working with a single GPU. Fix: state of the optimizer ex: Adam moved from GPU memory to CPU memory till the long seq. is read. then when the GPU memory is free, the optimizer state is moved back to the GPU. paged optimizers is a part of bitsandbytes library and can be enabled using flag .is_paged from Optimizer8bit class.

RLHF - PPO VS DPO

https://arxiv.org/abs/2305.18290: Direct Preference Optimization: Your Language Model is Secretly a Reward Model https://www.youtube.com/watch?v=QXVCqtAZAn4&ab_channel=DeepLearningAI: Aligning LLMs with Direct Preference Optimization

llmlingua (Prompt Compression):

Natural language is redundant, amount of information varies.
LLMs can understand compressed prompt.
There is a trade-off between language completeness and compression ratio. (LLMLingua)
GPT-4 can recover all the key information from a compressed prompt-emergent ability. (LLMLingua)
The density and position of key information in a prompt affect the performance of downstream tasks. (LongLLMLingua)

https://arxiv.org/abs/2310.05736 LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

https://arxiv.org/abs/2310.06839 LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression

LLM Compherensive Analysis Papers

Embedding

https://arxiv.org/abs/2205.13147: Matryoshka Representation Learning

After further testing, the most exciting feature (for us) is that the 256-dimensional version of text-embedding-3-large can outperform the 1536-dimensional Ada 002. That is a 6x reduction in vector size.
OpenAI confirmed (after some prodding) that they achieved this via Matryoshka Representation Learning.
MRL encodes information at different embedding dimensionalities. As per the paper, this enables up to 14x smaller embedding sizes with negligible degradation in accuracy.
MTEB: This benchmark covers 56 different tasks, including retrieval, classification, re-ranking, clustering, summarization, and more. Depending on your goals, you can look at the precise subset of tasks representing your use case.
BEIR: This benchmark focuses on the retrieval task and adds complexity in the form of different types and domains of questions, such as fact-checking, biomedical questions, or detecting duplicate questions. MTEB is largely a superset of the BEIR benchmark,

https://arxiv.org/pdf/1909.10351.pdf: TINYBERT: DISTILLING BERT FOR NATURAL LANGUAGE UNDERSTANDING https://arxiv.org/pdf/1910.01108.pdf: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance

https://arxiv.org/abs/2203.12533 PATHWAYS: ASYNCHRONOUS DISTRIBUTED DATAFLOW FOR ML

https://arxiv.org/abs/2204.02311 PaLM: Scaling Language Modeling with Pathways

https://arxiv.org/abs/1906.08237 XLNet: Generalized Autoregressive Pretraining for Language Understanding

enesbol/NLP