/NLP-Journey

Embarking on my NLP journey! This repo tracks my progress with code, projects, and notes. Join me as I explore data, models, and applications. Let's learn together!

NLP Journey - Roadmap to Learn LLMs from Scratch with Modern NLP Methods in 2024

This repository provides a comprehensive guide for learning Natural Language Processing (NLP) from the ground up, progressing to the understanding and application of Large Language Models (LLMs). It focuses on practical skills needed for NLP and LLM-related roles in 2024 and beyond. We'll leverage Jupyter Notebooks for hands-on practice.

Chapter 1: Foundations of NLP

Core NLP Concepts

Topic Resources
Introduction to NLP: Syntax, Semantics, Pragmatics, Discourse What is Natural Language Processing (NLP)?)

Text Preprocessing & Feature Engineering

Topic Resources Practices
Tokenization (Word, Subword - BPE, SentencePiece) Hugging Face Tokenizers, Tokenization, Lemmatization, Stemming, and Sentence Segmentation, Andrej Karpathy: Let's build the GPT Tokenizer []
Stemming & Lemmatization Stanford: Stemming and lemmatization, NLTK Stemming and Lemmatization []
Stop Word Removal, Punctuation Handling NLTK Stop Words []
Bag-of-Words (BoW), TF-IDF, N-grams Scikit-learn: Text Feature Extraction []

Word Embeddings

Topic Resources Practices
Word2Vec, GloVe, FastText Jay Alammar - Illustrated Word2Vec, Gensim Word2Vec, Stanford GloVe []
Contextual Embeddings (ELMo, BERT) Stanford NLP: N-gram Language Models []

Chapter 2: Essential NLP Tasks & Algorithms

Topic Resources Practices
Text Classification (Naive Bayes, SVM, Logistic Regression, Deep Learning) Scikit-learn Text Classification, Hugging Face Text Classification, FastText []
Sentiment Analysis (Lexicon-Based, Machine Learning, Aspect-Based) NLTK Sentiment Analysis, TextBlob Sentiment Analysis, VADER Sentiment Analysis []
Named Entity Recognition (NER) (NLTK, spaCy, Transformers) NLTK NER, spaCy NER, Hugging Face NER, MIT Information Extraction Toolkit []
Text Clustering (K-Means, Hierarchical Clustering, DBSCAN, OPTICS) Scikit-learn Clustering []
Topic Modeling (LDA, NMF) Gensim Topic Modeling, Scikit-learn NMF, BigARTM []
Information Retrieval (TF-IDF, BM25, Query Expansion, Vector Search, Semantic Search) Elasticsearch, Solr, Pinecone []
Question Answering DrQA, Document-QA []
Knowledge Extraction Template-Based Information Extraction without the Templates, Privee: An Architecture for Automatically Analyzing Web Privacy Policies, LEGALO []

NLP Applications

Topic Resources Practices
Dialogue Systems Chat script, Chatter bot, RiveScript, SuperScript, BotKit []
Machine Translation Berkeley Aligner, cdec, Jane, Joshua, Moses, alignment-with-openfst, zmert []
Text Summarization IndoSum, Cohere Summarize Beta []

Chapter 3: Deep Learning for NLP

Neural Network Fundamentals

Topic Resources Practices
Neural Network Basics, Backpropagation 3Blue1Brown - Neural Networks, freeCodeCamp - Deep Learning Crash Course []
Perceptron 3Blue1Brown - Neural Networks []

Deep Learning Frameworks

Topic Resources Practices
PyTorch, JAX, TensorFlow PyTorch Tutorials, JAX Documentation, TensorFlow Tutorials, Caffe []
MxNet, Numpy MxNet + Numpy []

Deep Learning Architectures for NLP

Topic Resources Practices
Recurrent Neural Networks (RNNs) (Sequence Modeling, LSTMs, GRUs, Attention) colah's blog: Understanding LSTMs, Andrej Karpathy: The Unreasonable Effectiveness of Recurrent Neural Networks, Bayesian Recurrent Neural Network for Language Modeling, RNNLM, KALDI LSTM []
Convolutional Neural Networks (CNNs) for Text (Classification, Hierarchical CNNs) Understanding Convolutional Neural Networks for NLP, Kim Yoon: Convolutional Neural Networks for Sentence Classification []
Sequence-to-Sequence Models (Attention, Transformers, T5, BART) Jay Alammar: The Illustrated Transformer, Google AI Blog: Transformer Networks, Hugging Face: T5, Hugging Face: BART []

Chapter 4: Large Language Models (LLMs)

The Transformer Architecture

Topic Resources Practices
Attention, Residual Connections, Layer Normalization, RoPE The Illustrated Transformer, The Illustrated GPT-2, Visual Intro to Transformers, LLM Visualization, nanoGPT, GPT in 60 Lines of NumPy []

LLM Architectures, Pre-training, & Post-training

Topic Resources Practices
GPT, BERT, T5, Llama, PaLM, Phi-3 LLMDataHub, Hugging Face: Causal Language Modeling, TinyLlama, Chinchilla's Wild Implications, BLOOM, OPT-175 Logbook, LLM 360, New LLM Pre-training and Post-training Paradigms, Phi-3CookBook, Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling, LLM Reading List, Mamba: Linear-Time Sequence Modeling with Selective State Spaces, DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, Jamba: A Hybrid Transformer-Mamba Language Model, Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality, The Llama 3 Herd of Models []
Emerging Architectures (DeepSeek-v2, Jamba, Mixture of Experts - MoE) DeepSeek-v2, Jamba, Hugging Face: Mixture of Experts Explained, Create MoEs with MergeKit Notebook, GLaM: Efficient Scaling of Language Models with Mixture-of-Experts, Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity []

Fine-tuning & Adapting LLMs

Topic Resources Practices
Supervised Fine-Tuning (SFT) Fine-Tune Your Own Llama 2 Model, Padding Large Language Models, A Beginner's Guide to LLM Fine-Tuning, unslothai, Fine-tune Llama 2 with QLoRA Notebook, Fine-tune CodeLlama using Axolotl Notebook, Fine-tune Mistral-7b with QLoRA Notebook, Fine-tune Mistral-7b with DPO Notebook, Fine-tune Llama 3 with ORPO Notebook, Fine-tune Llama 3.1 with Unsloth Notebook, flux-finetune, torchtune, Flan Collection: Designing Data and Methods for Effective Instruction Tuning []
Parameter-Efficient Fine-tuning (PEFT) (LoRA, Adapters, Prompt Tuning) LoRA Insights, Hugging Face: Parameter-Efficient Fine-Tuning, FLAN, T0 []
Reinforcement Learning from Human Feedback (RLHF) (PPO, DPO) Distilabel, An Introduction to Training LLMs using RLHF, Hugging Face: Illustration RLHF, Hugging Face: Preference Tuning LLMs, LLM Training: RLHF and Its Alternatives, Fine-tune Mistral-7b with DPO, Direct Preference Optimization: Your Language Model is Secretly a Reward Model, Training language models to follow instructions with human feedback, WebGPT: Browser-assisted question-answering with human feedback, Improving alignment of dialogue agents via targeted human judgements, OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization []
Model Merging (SLERP, DARE/TIES, FrankenMoEs) Merge LLMs with mergekit, DARE/TIES, Phixtral, MergeKit, Merge LLMs with MergeKit Notebook, LazyMergekit Notebook []

LLM Evaluation

Topic Resources Practices
LLM Evaluation Benchmarks & Tools lm-evaluation-harness, MixEval, lighteval, OLMO-eval, instruct-eval, simple-evals, Giskard, LangSmith, Ragas, Chatbot Arena Leaderboard, MixEval Leaderboard, AlpacaEval Leaderboard, Open LLM Leaderboard, OpenCompass 2.0 LLM Leaderboard, Berkeley Function-Calling Leaderboard, HELM, BIG-bench []

Prompt Engineering

Topic Resources Practices
Prompt Engineering Techniques (Zero-Shot, Few-Shot, Chain-of-Thought, ReAct) Prompt Engineering Guide, Lilian Weng: Prompt Engineering, LLM Prompt Engineering Simplified Book, Chain-of-Thoughts Papers, Awesome Deliberative Prompting, Instruction-Tuning-Papers, Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Awesome ChatGPT Prompts, awesome-chatgpt-prompts-zh []
Task-Specific Prompting (e.g., Code Generation) Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering, Codex []
Structuring LLM Outputs (Templates, JSON, LMQL, Outlines, Guidance) Chat Template, Outlines - Quickstart, LMQL - Overview, Microsoft Guidance, Guidance, Outlines []

Retrieval Augmented Generation (RAG)

Topic Resources Practices
Document Loaders, Text Splitters LangChain Text Splitters, LlamaIndex Data Connectors []
Embedding Models Sentence Transformers Library, MTEB Leaderboard, InferSent []
Vector Databases (Chroma, Pinecone, Milvus, FAISS, Annoy) Chroma, Pinecone, Milvus, FAISS, Annoy []
Orchestrators (LangChain, LlamaIndex, FastRAG) LangChain, LlamaIndex, FastRAG, 🦜🔗 Awesome LangChain []
Query Expansion, Re-ranking, HyDE HyDE, LangChain Retrievers []
RAG Fusion RAG-fusion []
Evaluation (Context Precision/Recall, Faithfulness, Relevancy, Ragas, DeepEval) Ragas, DeepEval []
Query Construction (SQL, Cypher) LangChain Query Construction, LangChain SQL []
Agents & Tools (Google Search, Wikipedia, Python, Jira) LangChain Agents []
Programmatic LLMs (DSPy) DSPy, dspy []

Chapter 5: Multimodal Learning & Applications

Topic Resources Practices
Multimodal LLMs (CLIP, ViT, LLaVA, MiniCPM-V, GPT-SoVITS) OpenAI CLIP, Google AI Blog: ViT, LLaVA, MiniCPM-V 2.6, GPT-SoVITS []
Vision-Language Tasks (Image Captioning, VQA, Visual Reasoning) Hugging Face: Vision-Language Tasks, Microsoft Kosmos-1, Google PaLM-E, Visual Instruction Tuning []
Text-to-Image Generation, Video Understanding Stability AI: Stable Diffusion, OpenAI DALL-E 2, Hugging Face: Video Understanding, Deep-Live-Cam []
Emerging Trends (Neuro-Symbolic AI, LLMs for Robotics) []

Chapter 6: Deployment & Productionizing LLMs

Deployment Strategies

Topic Resources Practices
Local Servers (LM Studio, Ollama, Oobabooga, Kobold.cpp) LM Studio, Ollama, oobabooga, kobold.cpp, llama.cpp, mistral.rs, Serge []
Cloud Deployment (AWS, GCP, Azure, SkyPilot, Specialized Hardware (TPUs)) SkyPilot, Hugging Face Inference API, Together AI, Modal, Metal []
Serverless Functions, Edge Deployment (MLC LLM, mnn-llm) AWS Lambda, Google Cloud Functions, Azure Functions, MLC LLM, mnn-llm []
LLM Serving LitServe, vLLM, TGI, FastChat, Jina, LangServe

Inference Optimization

Topic Resources Practices
Quantization (GPTQ, EXL2, GGUF, llama.cpp, exllama) Introduction to Quantization, Quantization with GGUF and llama.cpp Notebook, 4-bit LLM Quantization with GPTQ, 4-bit Quantization using GPTQ Notebook, ExLlamaV2: The Fastest Library to Run LLMs, ExLlamaV2 Notebook, AutoQuant Notebook, exllama []
Flash Attention, Key-Value Cache (MQA, GQA) Flash-Attention, Multi-Query Attention, Grouped-Query Attention []
Knowledge Distillation, Pruning Distilling the Knowledge in a Neural Network, To prune, or not to prune: exploring the efficacy of pruning for model compression []
Speculative Decoding Hugging Face: Assisted Generation []

Building with LLMs

Topic Resources Practices
APIs (OpenAI, Google, Anthropic, Cohere, OpenRouter, Hugging Face) OpenAI API, Google AI Platform, Anthropic API, Cohere API, OpenRouter, Hugging Face Inference API, GPTRouter []
Web Frameworks (Gradio, Streamlit) Gradio, Streamlit, ZeroSpace Notebook []
User Interfaces, Chatbots Chainlit, Langchain-Chatchat, llm-ui []
End-to-End LLM Projects Awesome NLP Projects []
LLM Application Frameworks LangChain, Haystack, Semantic Kernel, LlamaIndex, LMQL, ModelFusion, Flappy, LiteChain, magentic []

MLOps for LLMs

Topic Resources Practices
CI/CD, Monitoring, Model Management CometLLM, MLflow, Kubeflow, Evidently, Arthur Shield, Mona, Openllmetry, Graphsignal, Arize-Phoenix []
Experiment Tracking, Model Versioning Weights & Biases, MLflow Tracking []
Data & Model Pipelines ZenML, DVC []

LLM Security

Topic Resources Practices
Prompt Hacking (Injection, Leaking, Jailbreaking) OWASP LLM Top 10, Prompt Injection Primer, Awesome LLM Security []
Backdoors (Data Poisoning, Trigger Backdoors) Trojaning Language Models for Fun and Profit, Hidden Trigger Backdoor Attacks []
Defensive Measures (Red Teaming, Garak, Langfuse) Red Teaming LLMs, garak, Langfuse []