/nlp-llms-resources

Master list of curated resources on NLP and LLMs

Master NLP and LLM Resource List

This is the master resource list for NLP from scratch. This is a living document and will continually be updated and so should always be considered a work in progress. If you find any dead links or other issues, feel free to submit an issue.

This document is quite large, so you may wish to use the Table of Contents automatically generated by Github to find what you are looking for:

Thanks, and enjoy!

Traditional NLP

Datasets

  • nlp-datasets: Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP)
  • awesome-public-datasets - Natural Language: Natural language section of the awesome public datasets github page
  • SMS Spam Dataset: The “Hello World” of NLP datasets, ~55K SMS messages with label of spam/not spam for binary classification. Hosted on UC Irvine Machine Learning repository.
  • IMDB dataset: The other “Hello World” of datasets for NLP, 50K “highly polar” movie reviews scraped from IMDB and compiled by Andrew Maas of Stanford.
  • Twitter Airline Sentiment: Tweets from February of 2015 and associated sentiment labels at major US airlines - hosted on Kaggle (~3.5MB)
  • CivilCommentst: Dataset from the Civil Comments platform which shut down in 2017. 2M public comments with labels for toxicity, obscenity, threat, insulting, etc.
  • Cornell Movie Dialog: ~220K conversations from 10K pairs of characters across 617 popular movies, compiled by Cristian Danescu-Niculescu-Mizil of Cornell. Tabular compiled format available on Hugging Face.
  • CNN Daily Mail: “Hello World” dataset for summarization, consisting of articles from CNN and Daily Mail and accompanying summaries. Also available through Tensorflow and via Hugging Face.
  • Entity Recognition Datasets: Very large list of named entity recognition (NER) datasets (on Github).
  • WikiNER: 7,200 manually-labelled Wikipedia articles across nine languages: English, German, French, Polish, Italian, Spanish,Dutch, Portuguese and Russian.
  • OntoNotes: Large corpus comprising various genres of text in three languages with structural information and shallow semantic information.
  • Flores-101 - Multilingual, multi-task dataset from Meta for machine translation research, focusing on “low resource” languages. Associated Github repo.
  • CulturaX: Open dataset of 167 languages with over 6T words, the largest multilingual dataset ever released
  • Amazon Review Datasets: Massive datasets of reviews from Amazon.com, compiled by Julian McAuley of University of California San Diego
  • Yelp Open Dataset: 7M reviews, 210K businesses, and 200K images released by Yelp. Note the educational license.
  • Google Books N-grams: Very large dataset (2.2TB) of all the n-grams from Google Books. Also available hosted in an S3 bucket by AWS.
  • Sentiment Analysis @ Stanford NLP: Includes a link to the dataset of movie reviews used for Stanford Sentiment Treebank 2 (SST2). Also available on Hugging Face.
  • CoNLL-2003: Language-independent entity recognition dataset from the Conference on Computational Natural Language Learning (CoNLL-2003) shared task. Foundational datasets for named entity recognition (NER).
  • LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset: Large scale dataset of LLM 1M conversations with LLMs collected from Chatbot Arena website.
  • TabLib: Largest publicly available dataset of tabular tokens (627M tables, 867B tokens), to encourage the community to build Large Data Models that better understand tabular data
  • LAION 5B: Massive dataset of images and captions from Large-scale Artificial Intelligence Open Network (LAION), used to train Stable Diffusion.
  • Databricks Dolly 15K: Instruction dataset compiled internally by Databricks, used to train the Dolly models based on the Pythia LLMs.

Data Acquistion

Libraries

  • Natural Language Toolkit (NLTK): Core and essential NLP python library put together for teaching purposes by University of Pennsylvania, now fundamental to NLP work.
  • spaCy: Fundamental python NLP library for “industrial-strength natural language processing”, focused on building production systems.
  • Gensim: open-source python library with a focus on topic modeling, semantic similarity, and embeddings. Also contains implementations of word2vec and doc2vec.
  • fastText: Open-source, free, lightweight library that allows users to learn text representations (embeddings) and text classifiers. Includes pre-trained word vectors from Wikipedia and Common Crawl. From Meta’s FAIR Group.
  • KerasNLP: Natural language processing with deep learning and LLMs in Keras using Tensorflow, Pytorch, or JAX. Includes models such as BERT, GPT, and OPT.
  • Tensorflow Text: Lower level than KerasNLP, text manipulation built into Tensorflow.
  • Stanford CoreNLP: Java-based NLP library from Stanford, still important and in use
  • TextBlob: Easy to use NLP library in Python, including simple sentiment scoring and part-of-speech (POS) tagging.
  • Scikit-learn (sklearn): The essential library for doing machine learning in python, but more specifically for working with text data.
  • SparkNLP: Essential Big Data library for NLP work from John Snow Labs. Take a look at their extensive model repo. Github repo with lots of resources here. Medium post here on using the T5 model for classification with SparkNLP.

Neural Networks / Deep Learning

Sentiment Analysis

Optical Character Recognition (OCR)

Information Extraction and NERD

  • RAKE: Rapid Automatic Keyword Extraction, a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurrence with other words in the text.
  • YAKE: Yet Another Keyword Extractor is a light-weight unsupervised automatic keyword extraction method which rests on text statistical features extracted from single documents to select the most important keywords of a text.
  • Pytextrank: Python implementation of TextRank and associated algorithms as a spaCy pipeline extension, for information extraction and extractive summarization.
  • PKE (Python Keyphrase Extraction): open source python-based keyphrase extraction toolkit, implementing a variety of algorithms. Uses spaCy.
  • KeyBERT: Keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.
  • UniversalNER: Targeted distillation model for named entity recognition from Microsoft Research and USC, based on data generated by ChatGPT.
  • SpanMarker: Framework for NER models based on transformers such as BERT, RoBERTa and ELECTRA using Hugging Face Transformers (HF page)

Semantics and Syntax

  • Treebank: Definition at Wikipedia
  • Universal Dependencies: Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages.
  • UDPipe: UDPipe is a trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files.

Topic Modeling & Embedding

Multilingual NLP and Machine Translation:

Natural Language Inference (NLI) and Natural Language Understanding (NLU)

  • Adversarial NLI: Benchmark for NLI from Meta research and associated dataset.

Interviewing

Large Language Models (LLMs) and Gen AI

Introductory LLMs

Foundation Models

Text Generation

Summarization

Fine-tuning LLMs

Model Quantization

Data Labeling

  • Label Studio: Open source python library / framework for data labelling

LLM Development:

  • GPT4All: Locally-hosted LLM from Nomic for offline development.
  • LM Studio: Software framework for local LLM development and usage.
  • OpenAI Cookbook: Recipes and tutorial posts for working and building with OpenAI, all in one place. Example code in the Github repo.
  • SuperWhisper: Local usage of Whisper model on Mac OS, allows you to speak commands to your machine and have them transcribed (all locally).
  • Cursor: Locally installable code editor with autocomplete, chat, etc. backed by OpenAI GPT3.5/4.

Multimodal LLMs

Images

Audio

Video and Animation

  • Generative Image Dynamics: Model from researchers at Google for creating looping images or interactive images from still ones.
  • IDEFICS: Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS - based on Flamingo
  • IDEFICS: Open multimodal text and image model from Hugging Face based on [Flamingo](https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model NeRF (), similar to GPT4-V.
  • NeRF; Neural Radiance fields creates multiple views of a scene from a single image.
  • ZipNeRF: Building on NeRF with more advanced techniques and impressive results, generating drone-style “fly-by” videos from still images of settings.
  • Pegasus-1: Multimodal model from TwelveLabs for describing videos and video-to-text generation.
  • Gen-2 by RunwayML: Video-generating multimodal model from Runway ML that takes text or images as input.
  • Replay: Video (animated picture) generating model from Genmo AI
  • Hotshot XL: Text to animated GIF generator based on Stable Diffusion XL. Github and Hugging Face model page.
  • ModelScope: Open model for text-to-video generation from Alibaba research

Other Multimodal LLM Applications

  • DreamBooth3D: Approach for generating high-quality custom 3D models from source images.
  • MVDream: 3D model generation from Diffusion from researchers at ByteDance.
  • TADA! Text to Animatable Digital Avatars: Research on models for synthetic generation of 3D avatars from text prompts, from researchers in China and Germany
  • GATO: Generalist agent from Google Deepmind research for many tasks and media types
  • Tome: Startup for AI-generated slides (Powerpoint)

Domain-specific LLMs

Code

  • Github Copilot: Github’s AI coding assistant, based on OpenAI’s Codex model.
  • GitHub Copilot Fundamentals - Understand the AI pair programmer: Introductory online training / short course on Copilot from Microsoft.
  • CodeCompose: (TechCruch article): Meta’s internal coding LLM / answer to Copilot
  • CodeInterpreter: Experimental ChatGPT plugin that provides it with access to executing python code.
  • StableCode: Stability AI’s generative LLM coding model. Hugging Face collection here. Github here.
  • Starcoder: Coding LLM from Hugging Face. Github is here.
  • Ghostwriter: an AI-powered programming assistant from Replit AI.
  • DeciCoder 1B: Code completion LLM from Deci AI, trained on Starcoder dataset.
  • SQLCoder: Open text-to-SQL query models fine-tuned on Starcoder, from Defog AI. Demo is here.
  • CodeLLama: Fine-tuned version of LLaMA 2 for coding tasks, from Meta.
  • Tabby: Open source, locally-hosted coding assistant framework. Can use Starcoder or CodeLLaMA.
  • DuetAI for Developers: Coding assistance based on PaLM as part of Google’s DuetAI offering.
  • Gorilla LLM: LLM model from researchers at UC Berkeley trained to generate API calls across many different platforms and tools.

Mathematics

Finance

  • BloombergGPT: LLM trained by Bloomberg from scratch based on code / approaches from BLOOM
  • FinGPT: Finance-specific family of models trained with RLHF, fine-tuned from various base foundation models.

Science and Health

  • Galactica: (MIT Blog Post) Learnings from Meta’s Galactica LLM, trained on scientific research papers.
  • BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining, open LLM from Microsoft Research trained on PubMeb papers.
  • MedPALM: A large language model from Google Research, designed for the medical domain.

Vector Databases and Frameworks

  • Docarray: python library for nested, unstructured, multimodal data in transit, including text, image, audio, video, 3D mesh, and so on.
  • Faiss: Library for efficient similarity search and clustering of dense vectors from Meta Research.
  • Pinecone: Vector database is a vector-based database that offers high-performance search and similarity matching.
  • Weaviate: Open-source vector database to store data objects and vector embeddings from your favorite ML-models.
  • Chroma: Open-source vector store used for storing and retrieving vector embeddings and metadata for use with large language models.
  • Milvus: Vector database built for scalable similarity search.
  • AstraDB: Datastax’s vector database offering built atop of Apache Cassandra.
  • Activeloop: Database for AI powered by a unique storage format optimized for deep-learning and Large Language Model (LLM) based applications.
  • OSS Chat: Demo of RAG from Zilliz, allowing chat with OSS documentation.

Evaluation

  • The Stanford Natural Language Inference (SNLI) Corpus: Foundational dataset for NLI-based evaluation, 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral.
  • GLUE: General Language Understanding Evaluation Benchmark from NYU, University of Washington, and Google - model evaluation using Natural Language Inference (NLI) tasks.
  • SuperGLUE: The Super General Language Understanding Evaluation, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard.
  • SQuAD (Stanford Question Answering Dataset): Reading comprehension question answering dataset for LLM evaluation.
  • BigBench: The Beyond the Imitation Game Benchmark (BIG-bench) from Google, a collaborative benchmark with over 200 tasks.
  • BigBench Hard: Subset of BigBench tasks considered to be the most challenging, with associated paper.
  • MMLU: Multi-task Language Understanding is a benchmark developed by researchers at UC Berkeley and others to specifically measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings.
  • HeLM: Holistic Evaluation of Language Models, a “living” benchmark designed to be comprehensive, from the Center for Research on Foundation Models (CRFM) at Stanford.
  • HellaSwag: a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are trivial for humans (>95% accuracy).
  • Dynabench: A “platform for dynamic data collection and benchmarking”. Sort of a Kaggle / collaborative site for benchmarks and data collaboration, an effort of researchers from Meta and American universities.
  • Hugging Face Open LLM Leaderboard: Leaderboard from H4 (alignment) Group at Hugging Face. Largely open and fine-tuned models, though this can be filtered.
  • OpenCompass: Leaderboard for Chinese LLMs.
  • Evaluating LLMs is a minefield: Popular deck from researchers at Princeton (and authors of AI Snake Oil) on the pitfalls and intricacies of evaluating LLMs.
  • LM Contamination Index: The LM Contamination Index is a manually created database of contamination of LLM evaluation benchmarks.
  • The Curious Case of LLM Evaluation: In depth blog post, examining some of the finer nuances and sticking points of evaluating LLMs.
  • LLM Benchmarks: Dynamic dataset of crowd-sourced prompt that changes weekly for more realistic LLM evaluation.
  • Language Model Evaluation Harness: EleutherAI’s language model evaluation harness, a unified framework to test generative language models on over 200 different evaluation tasks.

Agents

  • AutoGPT: One of the most popular frameworks for using LLM agents, using the OpenAI API / GPT4.
  • ThinkGPT: python library for implementing Chain of Thoughts for LLMs, prompting the model to think, reason, and to create generative agents.
  • AutoGen: Multi-agent LLM framework for building applications from Microsoft.
  • XAgent: Open-source experimental agent, designed to be a general-purpose and applied to a wide range of tasks. From students at Tsinghua University.
  • Thought Cloning: Github repo for implementation of Thought Cloning (TC), an imitation learning framework by training agents to think like humans.
  • Demonstrate-Search-Predict (DSP): framework for solving advanced tasks with language models (LMs) and retrieval models (RMs).
  • ReAct Framework: Prompting method includes examples with actions, the observations gained by taking those actions, and transcribed thoughts (reasoning) for LLMs to take complex actions and reason or solve problems.
  • Tree of Thoughts (ToT): LLM reasoning process as a tree, where each node is an intermediate "thought" or coherent piece of reasoning that serves as a step towards the final solution.
  • GPT Engineer: Python framework for attempting to get GPT to write code and build software.
  • MetaGPT - The Multi-Agent Framework: Agent framework where different assigned roles (product managers, architects, project managers, engineers) are used for building different products (user stories, competitive analysis, requirements, data structures, etc.) given a requirement.

Application Frameworks:

  • LlamaIndex: LlamaIndex (formerly GPT Index) is a data framework for LLM applications to ingest, structure, and access private or domain-specific data. Usedl for RAG and building LLM applications working with stored data.
  • LangChain: LangChain is a framework for developing applications powered by language models.
  • Chainlit: Chainlit is an open-source Python package that makes it incredibly fast to build ChatGPT-like applications with your own business logic and data.

LLM Training, Training Frameworks, Training at Scale

  • Deepspeed: Deep learning optimization software suite that enables unprecedented scale and speed for DL Training and Inference from Microsoft.
  • Megatron-LM: From NVIDIA, Megatron-LM enables training large transformer language models with efficient tensor, pipeline and sequence-based model parallelism for pre-training transformer based language models.
  • GPT-NeoX: Eleuther AI’s library for large scale GPU training of LLMs, based on Megatron.
  • TRL (Transformer Reinforcement Learning): Library for Reinforcement Learning of Transformer and Stable Diffusion models built atop of the transformers library.
  • Autotrain Advanced: In development offering and python library from Hugging Face for easy and fast auto-training of LLMs and Stable Diffusion models.
  • Transformer Math: Detailed blog post from Eleuther AI on the mathematics of compute requirements for training LLMs

Reinforcement Learning from Human Feedback (RLHF)

Embeddings

LLM Serving

Preprocessing and Tokenization

  • Tiktoken: OpenAI’s BPE-based tokenizer
  • SentencePiece: Unsupervised text tokenizer and detokenizer for text generation systems from Google (but not an official product).

Open LLMs

  • GPT-J and GPT Neo-X: Open model trained from scratch by Eleuther AI.
  • Falcon 40B: Open text generation LLM from UAE’s Technology Innovation Institute (TII). Available on Hugging Face here.
  • Minotaur 15B: Fine-tuned version of Starcoder on open code datasets from the OpenAccess AI Collective
  • Mistral 7B: Popular open model from French startup Mistral with no fine-tuning (only pretraining).
  • MPT: Family of open models free for commercial use from MosaicML. Includes MPT Storywriter which has a 65K context window.
  • Qwen: Open LLM models from Alibaba Cloud in 7B and 14B sizes, including chat versions.
  • Fuyu-8B: Open multimodal model from Adept AI, a smaller version of the model that powers their commercial product.
  • ML Foundations: Github repo for Ludwig Schmidt from University of Washington, includes open versions of multimodal models Flamingo & CLIP

Visualization

Prompt Engineering

Ethics, Bias, and Legal

Costing

Books, Courses and other Resources

Communities

  • MLOps Community: Community of machine learning operations (MLOps) practitioners, but lately very much focused on LLMs.
  • LLMOps Space: global community for LLM practitioners & enthusiasts, focused on topics related to deploying LLMs into production
  • Aggregate Intellect Socratic Circles (AISC): Online community of ML and AI practitioners based in Toronto, with Slack server, journal club, and free talks
  • /r/LanguageTechnology: Reddit community on Natural Language Processing and LLMs with over 40K members
  • /r/LocalLLaMA: Subreddit to discuss training Llama and development around it, though also contains a lot of good general LLM discussion.

MOOCS and Courses

Books

Surveys

Aggregators and Online Resources

Papers (WIP)

Conferences and Societies