NLP Journey - Roadmap to Learn LLMs from Scratch with Modern NLP Methods in 2024

This repository provides a comprehensive guide for learning Natural Language Processing (NLP) from the ground up, progressing to the understanding and application of Large Language Models (LLMs). It focuses on practical skills needed for NLP and LLM-related roles in 2024 and beyond. We'll leverage Jupyter Notebooks for hands-on practice.

Chapter 1: Foundations of NLP

Core NLP Concepts

Topic	Resources
Introduction to NLP: Syntax, Semantics, Pragmatics, Discourse	What is Natural Language Processing (NLP)?)

Text Preprocessing & Feature Engineering

Topic	Resources	Practices
Tokenization (Word, Subword - BPE, SentencePiece)	Hugging Face Tokenizers, Tokenization, Lemmatization, Stemming, and Sentence Segmentation, Andrej Karpathy: Let's build the GPT Tokenizer	[]
Stemming & Lemmatization	Stanford: Stemming and lemmatization, NLTK Stemming and Lemmatization	[]
Stop Word Removal, Punctuation Handling	NLTK Stop Words	[]
Bag-of-Words (BoW), TF-IDF, N-grams	Scikit-learn: Text Feature Extraction	[]

Word Embeddings

Topic	Resources	Practices
Word2Vec, GloVe, FastText	Jay Alammar - Illustrated Word2Vec, Gensim Word2Vec, Stanford GloVe	[]
Contextual Embeddings (ELMo, BERT)	Stanford NLP: N-gram Language Models	[]

Chapter 2: Essential NLP Tasks & Algorithms

Topic	Resources	Practices
Text Classification (Naive Bayes, SVM, Logistic Regression, Deep Learning)	Scikit-learn Text Classification, Hugging Face Text Classification, FastText	[]
Sentiment Analysis (Lexicon-Based, Machine Learning, Aspect-Based)	NLTK Sentiment Analysis, TextBlob Sentiment Analysis, VADER Sentiment Analysis	[]
Named Entity Recognition (NER) (NLTK, spaCy, Transformers)	NLTK NER, spaCy NER, Hugging Face NER, MIT Information Extraction Toolkit	[]
Text Clustering (K-Means, Hierarchical Clustering, DBSCAN, OPTICS)	Scikit-learn Clustering	[]
Topic Modeling (LDA, NMF)	Gensim Topic Modeling, Scikit-learn NMF, BigARTM	[]
Information Retrieval (TF-IDF, BM25, Query Expansion, Vector Search, Semantic Search)	Elasticsearch, Solr, Pinecone	[]
Question Answering	DrQA, Document-QA	[]
Knowledge Extraction	Template-Based Information Extraction without the Templates, Privee: An Architecture for Automatically Analyzing Web Privacy Policies, LEGALO	[]

NLP Applications

Topic	Resources	Practices
Dialogue Systems	Chat script, Chatter bot, RiveScript, SuperScript, BotKit	[]
Machine Translation	Berkeley Aligner, cdec, Jane, Joshua, Moses, alignment-with-openfst, zmert	[]
Text Summarization	IndoSum, Cohere Summarize Beta	[]

Chapter 3: Deep Learning for NLP

Neural Network Fundamentals

Topic	Resources	Practices
Neural Network Basics, Backpropagation	3Blue1Brown - Neural Networks, freeCodeCamp - Deep Learning Crash Course	[]
Perceptron	3Blue1Brown - Neural Networks	[]

Deep Learning Frameworks

Topic	Resources	Practices
PyTorch, JAX, TensorFlow	PyTorch Tutorials, JAX Documentation, TensorFlow Tutorials, Caffe	[]
MxNet, Numpy	MxNet + Numpy	[]

Deep Learning Architectures for NLP

Topic	Resources	Practices
Recurrent Neural Networks (RNNs) (Sequence Modeling, LSTMs, GRUs, Attention)	colah's blog: Understanding LSTMs, Andrej Karpathy: The Unreasonable Effectiveness of Recurrent Neural Networks, Bayesian Recurrent Neural Network for Language Modeling, RNNLM, KALDI LSTM	[]
Convolutional Neural Networks (CNNs) for Text (Classification, Hierarchical CNNs)	Understanding Convolutional Neural Networks for NLP, Kim Yoon: Convolutional Neural Networks for Sentence Classification	[]
Sequence-to-Sequence Models (Attention, Transformers, T5, BART)	Jay Alammar: The Illustrated Transformer, Google AI Blog: Transformer Networks, Hugging Face: T5, Hugging Face: BART	[]

Chapter 4: Large Language Models (LLMs)

The Transformer Architecture

Topic	Resources	Practices
Attention, Residual Connections, Layer Normalization, RoPE	The Illustrated Transformer, The Illustrated GPT-2, Visual Intro to Transformers, LLM Visualization, nanoGPT, GPT in 60 Lines of NumPy	[]

LLM Architectures, Pre-training, & Post-training

Topic	Resources	Practices
GPT, BERT, T5, Llama, PaLM, Phi-3	LLMDataHub, Hugging Face: Causal Language Modeling, TinyLlama, Chinchilla's Wild Implications, BLOOM, OPT-175 Logbook, LLM 360, New LLM Pre-training and Post-training Paradigms, Phi-3CookBook, Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling, LLM Reading List, Mamba: Linear-Time Sequence Modeling with Selective State Spaces, DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, Jamba: A Hybrid Transformer-Mamba Language Model, Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality, The Llama 3 Herd of Models	[]
Emerging Architectures (DeepSeek-v2, Jamba, Mixture of Experts - MoE)	DeepSeek-v2, Jamba, Hugging Face: Mixture of Experts Explained, Create MoEs with MergeKit Notebook, GLaM: Efficient Scaling of Language Models with Mixture-of-Experts, Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity	[]

Fine-tuning & Adapting LLMs

Topic	Resources	Practices
Supervised Fine-Tuning (SFT)	Fine-Tune Your Own Llama 2 Model, Padding Large Language Models, A Beginner's Guide to LLM Fine-Tuning, unslothai, Fine-tune Llama 2 with QLoRA Notebook, Fine-tune CodeLlama using Axolotl Notebook, Fine-tune Mistral-7b with QLoRA Notebook, Fine-tune Mistral-7b with DPO Notebook, Fine-tune Llama 3 with ORPO Notebook, Fine-tune Llama 3.1 with Unsloth Notebook, flux-finetune, torchtune, Flan Collection: Designing Data and Methods for Effective Instruction Tuning	[]
Parameter-Efficient Fine-tuning (PEFT) (LoRA, Adapters, Prompt Tuning)	LoRA Insights, Hugging Face: Parameter-Efficient Fine-Tuning, FLAN, T0	[]
Reinforcement Learning from Human Feedback (RLHF) (PPO, DPO)	Distilabel, An Introduction to Training LLMs using RLHF, Hugging Face: Illustration RLHF, Hugging Face: Preference Tuning LLMs, LLM Training: RLHF and Its Alternatives, Fine-tune Mistral-7b with DPO, Direct Preference Optimization: Your Language Model is Secretly a Reward Model, Training language models to follow instructions with human feedback, WebGPT: Browser-assisted question-answering with human feedback, Improving alignment of dialogue agents via targeted human judgements, OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization	[]
Model Merging (SLERP, DARE/TIES, FrankenMoEs)	Merge LLMs with mergekit, DARE/TIES, Phixtral, MergeKit, Merge LLMs with MergeKit Notebook, LazyMergekit Notebook	[]

LLM Evaluation

Topic	Resources	Practices
LLM Evaluation Benchmarks & Tools	lm-evaluation-harness, MixEval, lighteval, OLMO-eval, instruct-eval, simple-evals, Giskard, LangSmith, Ragas, Chatbot Arena Leaderboard, MixEval Leaderboard, AlpacaEval Leaderboard, Open LLM Leaderboard, OpenCompass 2.0 LLM Leaderboard, Berkeley Function-Calling Leaderboard, HELM, BIG-bench	[]

Prompt Engineering

Topic	Resources	Practices
Prompt Engineering Techniques (Zero-Shot, Few-Shot, Chain-of-Thought, ReAct)	Prompt Engineering Guide, Lilian Weng: Prompt Engineering, LLM Prompt Engineering Simplified Book, Chain-of-Thoughts Papers, Awesome Deliberative Prompting, Instruction-Tuning-Papers, Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Awesome ChatGPT Prompts, awesome-chatgpt-prompts-zh	[]
Task-Specific Prompting (e.g., Code Generation)	Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering, Codex	[]
Structuring LLM Outputs (Templates, JSON, LMQL, Outlines, Guidance)	Chat Template, Outlines - Quickstart, LMQL - Overview, Microsoft Guidance, Guidance, Outlines	[]

Retrieval Augmented Generation (RAG)

Topic	Resources	Practices
Document Loaders, Text Splitters	LangChain Text Splitters, LlamaIndex Data Connectors	[]
Embedding Models	Sentence Transformers Library, MTEB Leaderboard, InferSent	[]
Vector Databases (Chroma, Pinecone, Milvus, FAISS, Annoy)	Chroma, Pinecone, Milvus, FAISS, Annoy	[]
Orchestrators (LangChain, LlamaIndex, FastRAG)	LangChain, LlamaIndex, FastRAG, 🦜🔗 Awesome LangChain	[]
Query Expansion, Re-ranking, HyDE	HyDE, LangChain Retrievers	[]
RAG Fusion	RAG-fusion	[]
Evaluation (Context Precision/Recall, Faithfulness, Relevancy, Ragas, DeepEval)	Ragas, DeepEval	[]
Query Construction (SQL, Cypher)	LangChain Query Construction, LangChain SQL	[]
Agents & Tools (Google Search, Wikipedia, Python, Jira)	LangChain Agents	[]
Programmatic LLMs (DSPy)	DSPy, dspy	[]

Chapter 5: Multimodal Learning & Applications

Topic	Resources	Practices
Multimodal LLMs (CLIP, ViT, LLaVA, MiniCPM-V, GPT-SoVITS)	OpenAI CLIP, Google AI Blog: ViT, LLaVA, MiniCPM-V 2.6, GPT-SoVITS	[]
Vision-Language Tasks (Image Captioning, VQA, Visual Reasoning)	Hugging Face: Vision-Language Tasks, Microsoft Kosmos-1, Google PaLM-E, Visual Instruction Tuning	[]
Text-to-Image Generation, Video Understanding	Stability AI: Stable Diffusion, OpenAI DALL-E 2, Hugging Face: Video Understanding, Deep-Live-Cam	[]
Emerging Trends (Neuro-Symbolic AI, LLMs for Robotics)		[]

Chapter 6: Deployment & Productionizing LLMs

Deployment Strategies

Topic	Resources	Practices
Local Servers (LM Studio, Ollama, Oobabooga, Kobold.cpp)	LM Studio, Ollama, oobabooga, kobold.cpp, llama.cpp, mistral.rs, Serge	[]
Cloud Deployment (AWS, GCP, Azure, SkyPilot, Specialized Hardware (TPUs))	SkyPilot, Hugging Face Inference API, Together AI, Modal, Metal	[]
Serverless Functions, Edge Deployment (MLC LLM, mnn-llm)	AWS Lambda, Google Cloud Functions, Azure Functions, MLC LLM, mnn-llm	[]
LLM Serving	LitServe, vLLM, TGI, FastChat, Jina, LangServe

Inference Optimization

Topic	Resources	Practices
Quantization (GPTQ, EXL2, GGUF, llama.cpp, exllama)	Introduction to Quantization, Quantization with GGUF and llama.cpp Notebook, 4-bit LLM Quantization with GPTQ, 4-bit Quantization using GPTQ Notebook, ExLlamaV2: The Fastest Library to Run LLMs, ExLlamaV2 Notebook, AutoQuant Notebook, exllama	[]
Flash Attention, Key-Value Cache (MQA, GQA)	Flash-Attention, Multi-Query Attention, Grouped-Query Attention	[]
Knowledge Distillation, Pruning	Distilling the Knowledge in a Neural Network, To prune, or not to prune: exploring the efficacy of pruning for model compression	[]
Speculative Decoding	Hugging Face: Assisted Generation	[]

Building with LLMs

Topic	Resources	Practices
APIs (OpenAI, Google, Anthropic, Cohere, OpenRouter, Hugging Face)	OpenAI API, Google AI Platform, Anthropic API, Cohere API, OpenRouter, Hugging Face Inference API, GPTRouter	[]
Web Frameworks (Gradio, Streamlit)	Gradio, Streamlit, ZeroSpace Notebook	[]
User Interfaces, Chatbots	Chainlit, Langchain-Chatchat, llm-ui	[]
End-to-End LLM Projects	Awesome NLP Projects	[]
LLM Application Frameworks	LangChain, Haystack, Semantic Kernel, LlamaIndex, LMQL, ModelFusion, Flappy, LiteChain, magentic	[]

MLOps for LLMs

Topic	Resources	Practices
CI/CD, Monitoring, Model Management	CometLLM, MLflow, Kubeflow, Evidently, Arthur Shield, Mona, Openllmetry, Graphsignal, Arize-Phoenix	[]
Experiment Tracking, Model Versioning	Weights & Biases, MLflow Tracking	[]
Data & Model Pipelines	ZenML, DVC	[]

LLM Security

Topic	Resources	Practices
Prompt Hacking (Injection, Leaking, Jailbreaking)	OWASP LLM Top 10, Prompt Injection Primer, Awesome LLM Security	[]
Backdoors (Data Poisoning, Trigger Backdoors)	Trojaning Language Models for Fun and Profit, Hidden Trigger Backdoor Attacks	[]
Defensive Measures (Red Teaming, Garak, Langfuse)	Red Teaming LLMs, garak, Langfuse	[]

alirezasdb/NLP-Journey