This repository provides a comprehensive guide for learning Natural Language Processing (NLP) from the ground up, progressing to the understanding and application of Large Language Models (LLMs). It focuses on practical skills needed for NLP and LLM-related roles in 2024 and beyond. We'll leverage Jupyter Notebooks for hands-on practice.
Topic | Resources |
---|---|
Introduction to NLP: Syntax, Semantics, Pragmatics, Discourse | What is Natural Language Processing (NLP)?) |
Topic | Resources | Practices |
---|---|---|
Tokenization (Word, Subword - BPE, SentencePiece) | Hugging Face Tokenizers, Tokenization, Lemmatization, Stemming, and Sentence Segmentation, Andrej Karpathy: Let's build the GPT Tokenizer | [] |
Stemming & Lemmatization | Stanford: Stemming and lemmatization, NLTK Stemming and Lemmatization | [] |
Stop Word Removal, Punctuation Handling | NLTK Stop Words | [] |
Bag-of-Words (BoW), TF-IDF, N-grams | Scikit-learn: Text Feature Extraction | [] |
Topic | Resources | Practices |
---|---|---|
Word2Vec, GloVe, FastText | Jay Alammar - Illustrated Word2Vec, Gensim Word2Vec, Stanford GloVe | [] |
Contextual Embeddings (ELMo, BERT) | Stanford NLP: N-gram Language Models | [] |
Topic | Resources | Practices |
---|---|---|
Text Classification (Naive Bayes, SVM, Logistic Regression, Deep Learning) | Scikit-learn Text Classification, Hugging Face Text Classification, FastText | [] |
Sentiment Analysis (Lexicon-Based, Machine Learning, Aspect-Based) | NLTK Sentiment Analysis, TextBlob Sentiment Analysis, VADER Sentiment Analysis | [] |
Named Entity Recognition (NER) (NLTK, spaCy, Transformers) | NLTK NER, spaCy NER, Hugging Face NER, MIT Information Extraction Toolkit | [] |
Text Clustering (K-Means, Hierarchical Clustering, DBSCAN, OPTICS) | Scikit-learn Clustering | [] |
Topic Modeling (LDA, NMF) | Gensim Topic Modeling, Scikit-learn NMF, BigARTM | [] |
Information Retrieval (TF-IDF, BM25, Query Expansion, Vector Search, Semantic Search) | Elasticsearch, Solr, Pinecone | [] |
Question Answering | DrQA, Document-QA | [] |
Knowledge Extraction | Template-Based Information Extraction without the Templates, Privee: An Architecture for Automatically Analyzing Web Privacy Policies, LEGALO | [] |
Topic | Resources | Practices |
---|---|---|
Dialogue Systems | Chat script, Chatter bot, RiveScript, SuperScript, BotKit | [] |
Machine Translation | Berkeley Aligner, cdec, Jane, Joshua, Moses, alignment-with-openfst, zmert | [] |
Text Summarization | IndoSum, Cohere Summarize Beta | [] |
Topic | Resources | Practices |
---|---|---|
Neural Network Basics, Backpropagation | 3Blue1Brown - Neural Networks, freeCodeCamp - Deep Learning Crash Course | [] |
Perceptron | 3Blue1Brown - Neural Networks | [] |
Topic | Resources | Practices |
---|---|---|
PyTorch, JAX, TensorFlow | PyTorch Tutorials, JAX Documentation, TensorFlow Tutorials, Caffe | [] |
MxNet, Numpy | MxNet + Numpy | [] |
Topic | Resources | Practices |
---|---|---|
Recurrent Neural Networks (RNNs) (Sequence Modeling, LSTMs, GRUs, Attention) | colah's blog: Understanding LSTMs, Andrej Karpathy: The Unreasonable Effectiveness of Recurrent Neural Networks, Bayesian Recurrent Neural Network for Language Modeling, RNNLM, KALDI LSTM | [] |
Convolutional Neural Networks (CNNs) for Text (Classification, Hierarchical CNNs) | Understanding Convolutional Neural Networks for NLP, Kim Yoon: Convolutional Neural Networks for Sentence Classification | [] |
Sequence-to-Sequence Models (Attention, Transformers, T5, BART) | Jay Alammar: The Illustrated Transformer, Google AI Blog: Transformer Networks, Hugging Face: T5, Hugging Face: BART | [] |
Topic | Resources | Practices |
---|---|---|
Attention, Residual Connections, Layer Normalization, RoPE | The Illustrated Transformer, The Illustrated GPT-2, Visual Intro to Transformers, LLM Visualization, nanoGPT, GPT in 60 Lines of NumPy | [] |
Topic | Resources | Practices |
---|---|---|
LLM Evaluation Benchmarks & Tools | lm-evaluation-harness, MixEval, lighteval, OLMO-eval, instruct-eval, simple-evals, Giskard, LangSmith, Ragas, Chatbot Arena Leaderboard, MixEval Leaderboard, AlpacaEval Leaderboard, Open LLM Leaderboard, OpenCompass 2.0 LLM Leaderboard, Berkeley Function-Calling Leaderboard, HELM, BIG-bench | [] |
Topic | Resources | Practices |
---|---|---|
Prompt Engineering Techniques (Zero-Shot, Few-Shot, Chain-of-Thought, ReAct) | Prompt Engineering Guide, Lilian Weng: Prompt Engineering, LLM Prompt Engineering Simplified Book, Chain-of-Thoughts Papers, Awesome Deliberative Prompting, Instruction-Tuning-Papers, Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Awesome ChatGPT Prompts, awesome-chatgpt-prompts-zh | [] |
Task-Specific Prompting (e.g., Code Generation) | Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering, Codex | [] |
Structuring LLM Outputs (Templates, JSON, LMQL, Outlines, Guidance) | Chat Template, Outlines - Quickstart, LMQL - Overview, Microsoft Guidance, Guidance, Outlines | [] |
Topic | Resources | Practices |
---|---|---|
Document Loaders, Text Splitters | LangChain Text Splitters, LlamaIndex Data Connectors | [] |
Embedding Models | Sentence Transformers Library, MTEB Leaderboard, InferSent | [] |
Vector Databases (Chroma, Pinecone, Milvus, FAISS, Annoy) | Chroma, Pinecone, Milvus, FAISS, Annoy | [] |
Orchestrators (LangChain, LlamaIndex, FastRAG) | LangChain, LlamaIndex, FastRAG, 🦜🔗 Awesome LangChain | [] |
Query Expansion, Re-ranking, HyDE | HyDE, LangChain Retrievers | [] |
RAG Fusion | RAG-fusion | [] |
Evaluation (Context Precision/Recall, Faithfulness, Relevancy, Ragas, DeepEval) | Ragas, DeepEval | [] |
Query Construction (SQL, Cypher) | LangChain Query Construction, LangChain SQL | [] |
Agents & Tools (Google Search, Wikipedia, Python, Jira) | LangChain Agents | [] |
Programmatic LLMs (DSPy) | DSPy, dspy | [] |
Topic | Resources | Practices |
---|---|---|
Multimodal LLMs (CLIP, ViT, LLaVA, MiniCPM-V, GPT-SoVITS) | OpenAI CLIP, Google AI Blog: ViT, LLaVA, MiniCPM-V 2.6, GPT-SoVITS | [] |
Vision-Language Tasks (Image Captioning, VQA, Visual Reasoning) | Hugging Face: Vision-Language Tasks, Microsoft Kosmos-1, Google PaLM-E, Visual Instruction Tuning | [] |
Text-to-Image Generation, Video Understanding | Stability AI: Stable Diffusion, OpenAI DALL-E 2, Hugging Face: Video Understanding, Deep-Live-Cam | [] |
Emerging Trends (Neuro-Symbolic AI, LLMs for Robotics) | [] |
Topic | Resources | Practices |
---|---|---|
Local Servers (LM Studio, Ollama, Oobabooga, Kobold.cpp) | LM Studio, Ollama, oobabooga, kobold.cpp, llama.cpp, mistral.rs, Serge | [] |
Cloud Deployment (AWS, GCP, Azure, SkyPilot, Specialized Hardware (TPUs)) | SkyPilot, Hugging Face Inference API, Together AI, Modal, Metal | [] |
Serverless Functions, Edge Deployment (MLC LLM, mnn-llm) | AWS Lambda, Google Cloud Functions, Azure Functions, MLC LLM, mnn-llm | [] |
LLM Serving | LitServe, vLLM, TGI, FastChat, Jina, LangServe |
Topic | Resources | Practices |
---|---|---|
Quantization (GPTQ, EXL2, GGUF, llama.cpp, exllama) | Introduction to Quantization, Quantization with GGUF and llama.cpp Notebook, 4-bit LLM Quantization with GPTQ, 4-bit Quantization using GPTQ Notebook, ExLlamaV2: The Fastest Library to Run LLMs, ExLlamaV2 Notebook, AutoQuant Notebook, exllama | [] |
Flash Attention, Key-Value Cache (MQA, GQA) | Flash-Attention, Multi-Query Attention, Grouped-Query Attention | [] |
Knowledge Distillation, Pruning | Distilling the Knowledge in a Neural Network, To prune, or not to prune: exploring the efficacy of pruning for model compression | [] |
Speculative Decoding | Hugging Face: Assisted Generation | [] |
Topic | Resources | Practices |
---|---|---|
APIs (OpenAI, Google, Anthropic, Cohere, OpenRouter, Hugging Face) | OpenAI API, Google AI Platform, Anthropic API, Cohere API, OpenRouter, Hugging Face Inference API, GPTRouter | [] |
Web Frameworks (Gradio, Streamlit) | Gradio, Streamlit, ZeroSpace Notebook | [] |
User Interfaces, Chatbots | Chainlit, Langchain-Chatchat, llm-ui | [] |
End-to-End LLM Projects | Awesome NLP Projects | [] |
LLM Application Frameworks | LangChain, Haystack, Semantic Kernel, LlamaIndex, LMQL, ModelFusion, Flappy, LiteChain, magentic | [] |
Topic | Resources | Practices |
---|---|---|
CI/CD, Monitoring, Model Management | CometLLM, MLflow, Kubeflow, Evidently, Arthur Shield, Mona, Openllmetry, Graphsignal, Arize-Phoenix | [] |
Experiment Tracking, Model Versioning | Weights & Biases, MLflow Tracking | [] |
Data & Model Pipelines | ZenML, DVC | [] |
Topic | Resources | Practices |
---|---|---|
Prompt Hacking (Injection, Leaking, Jailbreaking) | OWASP LLM Top 10, Prompt Injection Primer, Awesome LLM Security | [] |
Backdoors (Data Poisoning, Trigger Backdoors) | Trojaning Language Models for Fun and Profit, Hidden Trigger Backdoor Attacks | [] |
Defensive Measures (Red Teaming, Garak, Langfuse) | Red Teaming LLMs, garak, Langfuse | [] |