The-NLP-Pandect: A Python repository from mnrclab

This pandect (πανδέκτης is Ancient Greek for encyclopedia) was created to help you find almost anything related to Natural Language Processing that is available online.

Compendiums and awesome lists on the topic of NLP:

The NLP Index by Quantum Stat / NLP Cypher
Awesome NLP by keon [GitHub, 12112 stars]
Speech and Natural Language Processing Awesome List by elaboshira [GitHub, 2024 stars]
Awesome Deep Learning for Natural Language Processing (NLP) [GitHub, 939 stars]
Text Mining and Natural Language Processing Resources by stepthom [GitHub, 388 stars]
Made with ML List by madewithml.com
Brainsources for #NLP enthusiasts by Philip Vollet
Awesome AI/ML/DL - NLP Section [GitHub, 908 stars]
Resources on various machine learning topics by Backprop

NLP Conferences, Paper Summaries and Paper Compendiums:

Non-English resources and compendiums

NLP Resources for Bahasa Indonesian [GitHub, 188 stars]
Indic NLP Catalog [GitHub, 267 stars]
Pre-trained language models for Vietnamese [GitHub, 377 stars]
Natural Language Toolkit for Indic Languages (iNLTK) [GitHub, 719 stars]
Indic NLP Library [GitHub, 366 stars]
AI4Bharat-IndicNLP Portal
ARBML - Implementation of many Arabic NLP and ML projects [GitHub, 181 stars]
zemberek-nlp - NLP tools for Turkish [GitHub, 935 stars]
KLUE - Korean Language Understanding Evaluation [GitHub, 271 stars]

Pre-trained NLP models

List of pre-trained NLP models [GitHub, 133 stars]

NLP Year in Review

2020

Natural Language Processing in 2020: The Year In Review [Blog, December 2020]
ML and NLP Research Highlights of 2020 [Blog, January 2021]

NLP-only podcasts

NLP Highlights [Years: 2017 - now, Status: active]

Many NLP episodes

TWIML AI [Years: 2016 - now, Status: active]
Practical AI [Years: 2018 - now, Status: active]
The Data Exchange [Years: 2019 - now, Status: active]
Gradient Dissent [Years: 2020 - now, Status: active]
Machine Learning Street Talk [Years: 2020 - now, Status: active]

Some NLP episodes

The Super Data Science Podcast [Years: 2016 - now, Status: active]
Data Hack Radio [Years: 2018 - now, Status: active]
AI Game Changers [Years: 2020 - now, Status: active]
The Analytics Show [Years: 2019 - now, Status: active]

General NLU

GLUE - General Language Understanding Evaluation (GLUE) benchmark
SuperGLUE - benchmark styled after GLUE with a new set of more difficult language understanding tasks
decaNLP - The Natural Language Decathlon (decaNLP) for studying general NLP models
RACE - ReAding Comprehension dataset collected from English Examinations
dialoglue - DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue
DynaBench - Dynabench is a research platform for dynamic data collection and benchmarking

Summarization

WikiAsp - WikiAsp: Multi-document aspect-based summarization Dataset

Question Answering

SQuAD - Stanford Question Answering Dataset (SQuAD)
XQuad - XQuAD (Cross-lingual Question Answering Dataset) for cross-lingual question answering
GrailQA - Strongly Generalizable Question Answering (GrailQA)
CSQA - Complex Sequential Question Answering

Multilingual and Non-English Benchmarks

XTREME - Massively Multilingual Multi-task Benchmark
GLUECoS - A benchmark for code-switched NLP
IndoNLU Benchmark - collection of resources for training, evaluating, and analyzing NLP for Bahasa Indonesia
IndicGLUE - Natural Language Understanding Benchmark for Indic Languages
LinCE - Linguistic Code-Switching Evaluation Benchmark
Russian SuperGlue - Russian SuperGlue Benchmark

Bio, Law, and other scientific domains

BLURB - Biomedical Language Understanding and Reasoning Benchmark
BLUE - Biomedical Language Understanding Evaluation benchmark

Transformer Efficiency

Long-Range Arena - Long Range Arena for Benchmarking Efficient Transformers (Pre-print) [GitHub, 301 stars]

Speech Processing

SUPERB - Speech processing Universal PERformance Benchmark

Other

CodeXGLUE - A benchmark dataset for code intelligence
CrossNER - CrossNER: Evaluating Cross-Domain Named Entity Recognition
MultiNLI - Multi-Genre Natural Language Inference corpus

General

A Recipe for Training Neural Networks by Andrej Karpathy [Keywords: research, training, 2019]

Embeddings

Repositories

Pre-trained ELMo Representations for Many Languages [GitHub, 1337 stars]
sense2vec - Contextually-keyed word vectors [GitHub, 1261 stars]
wikipedia2vec [GitHub, 703 stars]
StarSpace [GitHub, 3627 stars]
fastText [GitHub, 22749 stars]

Blogs

Language Models and Contextualised Word Embeddings by David S. Batista [Blog, 2018]
An Essential Guide to Pretrained Word Embeddings for NLP Practitioners by AnalyticsVidhya [Blog, 2020]
Polyglot Word Embeddings Discover Language Clusters [Blog, 2020]
The Illustrated Word2vec by Jay Alammar [Blog, 2019]

Cross-lingual Word and Sentence Embeddings

vecmap - VecMap (cross-lingual word embedding mappings) [GitHub, 553 stars]
sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 5498 stars]

Byte Pair Encoding

bpemb - Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) [GitHub, 969 stars]
subword-nmt - Unsupervised Word Segmentation for Neural Machine Translation and Text Generation [GitHub, 1735 stars]
python-bpe - Byte Pair Encoding for Python [GitHub, 147 stars]

Transformer-based Architectures

General

The Transformer Family by Lilian Weng [Blog, 2020]
Keeping up with the BERTs: a review of the main NLP benchmarks by Manuel Tonneau [Blog, 2020]
Playing the lottery with rewards and multiple languages - about the effect of random initialization [ICLR 2020 Paper]
Attention? Attention! by Lilian Weng [Blog, 2018]
the transformer … “explained”? [Blog, 2019]
Attention is all you need; Attentional Neural Network Models by Łukasz Kaiser [Talk, 2017]
Understanding and Applying Self-Attention for NLP [Talk, 2018]
The NLP Cookbook: Modern Recipes for Transformer based Deep Learning Architectures [Paper, April 2021]
Pre-Trained Models: Past, Present and Future [Paper, June 2021]
A Survey of Transformers [Paper, June 2021]

Transformer

The Annotated Transformer by Harvard NLP [Blog, 2018]
The Illustrated Transformer by Jay Alammar [Blog, 2018]
Illustrated Guide to Transformers by Hong Jing [Blog, 2020]
Sequential Transformer with Adaptive Attention Span by Facebook. Blog [Blog, 2019]
Evolution of Representations in the Transformer by Lena Voita [Blog, 2019]
Reformer: The Efficient Transformer [Blog, 2020]
Longformer — The Long-Document Transformer by Viktor Karlsson [Blog, 2020]
TRANSFORMERS FROM SCRATCH [Blog, 2019]
Universal Transformers by Mostafa Dehghani [Blog, 2019]
Transformers in Natural Language Processing — A Brief Survey by George Ho [Blog, May 2020]
Lite Transformer - Lite Transformer with Long-Short Range Attention [GitHub, 440 stars]

BERT

A Visual Guide to Using BERT for the First Time by Jay Alammar [Blog, 2019]
The Dark Secrets of BERT by Anna Rogers [Blog, 2020]
Understanding searches better than ever before [Blog, 2019]
Demystifying BERT: A Comprehensive Guide to the Groundbreaking NLP Framework [Blog, 2019]
SemBERT - Semantics-aware BERT for Language Understanding [GitHub, 223 stars]
BERTweet - BERTweet: A pre-trained language model for English Tweets [GitHub, 340 stars]
Optimal Subarchitecture Extraction for BERT [GitHub, 431 stars]
CharacterBERT: Reconciling ELMo and BERT [GitHub, 113 stars]
When BERT Plays The Lottery, All Tickets Are Winning [Blog, Dec 2020]

Other Transformer Variants

T5

T5 Understanding Transformer-Based Self-Supervised Architectures [Blog, August 2020]
T5: the Text-To-Text Transfer Transformer [Blog, 2020]
multilingual-t5 - Multilingual T5 (mT5) is a massively multilingual pretrained text-to-text transformer model [GitHub, 676 stars]

BigBird

Big Bird: Transformers for Longer Sequences original paper by Google Research [Paper, July 2020]

Reformer / Linformer / Longformer / Performers

Reformer: The Efficient Transformer - [Paper, February 2020] [Video, October 2020]
Longformer: The Long-Document Transformer - [Paper, April 2020] [Video, April 2020]
Linformer: Self-Attention with Linear Complexity - [Paper, June 2020] [Video, June 2020]
Rethinking Attention with Performers - [Paper, September 2020] [Video, September 2020]
performer-pytorch - An implementation of Performer, a linear attention-based transformer, in Pytorch [GitHub, 646 stars]

Switch Transformer

Switch Transformers: Scaling to Trillion Parameter Models original paper by Google Research [Paper, January 2021]

GPT-family

General

The Illustrated GPT-2 by Jay Alammar [Blog, 2019]
The Annotated GPT-2 by Aman Arora
OpenAI’s GPT-2: the model, the hype, and the controversy by Ryan Lowe [Blog, 2019]
How to generate text by Patrick von Platen [Blog, 2020]

GPT-3

Learning Resources

Zero Shot Learning for Text Classification by Amit Chaudhary [Blog, 2020]
GPT-3 A Brief Summary by Leo Gao [Blog, 2020]
GPT-3, a Giant Step for Deep Learning And NLP by Yoel Zeldes [Blog, June 2020]
GPT-3 Language Model: A Technical Overview by Chuan Li [Blog, June 2020]
Is it possible for language models to achieve language understanding? by Christopher Potts

Applications

Aweseome GPT-3 - list of all resources related to GPT-3 [GitHub, 3219 stars]
GPT-3 Projects - a map of all GPT-3 start-ups and commercial projects
OpenAI API - API Demo to use GPT-3 for commercial applications

Open-source Efforts

GPT-Neo - in-progress GPT-3 open source replication
GPT-J - A 6 billion parameter, autoregressive text generation model trained on The Pile

Other

What is Two-Stream Self-Attention in XLNet by Xu LIANG [Blog, 2019]
Visual Paper Summary: ALBERT (A Lite BERT) by Amit Chaudhary [Blog, 2020]
Turing NLG by Microsoft
Multi-Label Text Classification with XLNet by Josh Xin Jie Lee [Blog, 2019]
ELECTRA [GitHub, 1827 stars]
Performer implementation of Performer, a linear attention-based transformer, in Pytorch [GitHub, 646 stars]

Distillation, Pruning and Quantization

Distilling knowledge from Neural Networks to build smaller and faster models by FloydHub [Blog, 2019]
David over Goliath: towards smaller models for cheaper, faster, and greener NLP by Manuel Tonneau [Blog, 2020]
Compression of Deep Learning Models for Text: A Survey (+Overview of Approaches) [Paper, April 2021]

Automated Summarization

PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization by Google AI [Blog, June 2020]
CTRLsum - CTRLsum: Towards Generic Controllable Text Summarization [GitHub, 62 stars]

Rule-based NLP

LemmInflect - A python module for English lemmatization and inflection

Best Practices for NLP

In Search of Best Practices for NLP Projects [Slides, Dec. 2020]
EMNLP 2020: High Performance Natural Language Processing by Google Research [Slides, Recording, Nov. 2020]
Practical Natural Language Processing - A Comprehensive Guide to Building Real-World NLP Systems [Book, June 2020]
How to Structure and Manage NLP Projects [Blog, May 2021]
Applied NLP Thinking - Applied NLP Thinking: How to Translate Problems into Solutions [Blog, June 2021]

Transformer-based Architectures

Why BERT Fails in Commercial Environments by Intel AI [Blog, 2020]
Fine Tuning BERT for Text Classification with FARM by Sebastian Guggisberg [Blog, 2020]
Pretrain Transformers Models in PyTorch using Hugging Face Transformers [GitHub, 97 stars]
Practical NLP for the Real World [Presentation, 2019]
From Paper to Product – How we implemented BERT by Christoph Henkelmann [Talk, 2020]

Embeddings as a Service

embedding-as-service [GitHub, 156 stars]
Bert-as-service [GitHub, 9384 stars]

NLP Recipes Industrial Applications:

NLP Recipes by microsoft [GitHub, 5590 stars]
NLP with Python by susanli2016 [GitHub, 2064 stars]
Basic Utilities for PyTorch NLP by PetrochukM [GitHub, 1929 stars]

NLP Applications in Bio, Finance, Legal and other industries

Blackstone - A spaCy pipeline and model for NLP on unstructured legal text [GitHub, 492 stars]
Sci spaCy - spaCy pipeline and models for scientific/biomedical documents [GitHub, 954 stars]
FinBERT: Pre-Trained on SEC Filings for Financial NLP Tasks [GitHub, 147 stars]
LexNLP - Information retrieval and extraction for real, unstructured legal text [GitHub, 458 stars]
NerDL and NerCRF - Tutorial on Named Entity Recognition for Healthcare with SparkNLP
Legal Text Analytics - A list of selected resources dedicated to Legal Text Analytics [GitHub, 294 stars]
BioIE - A curated list of resources relevant to doing Biomedical Information Extraction [GitHub, 155 stars]

Model and Data testing

WildNLP - Corrupt an input text to test NLP models' robustness [GitHub, 65 stars]
Great Expectations - Write tests for your data [GitHub, 4653 stars]
CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 1414 stars]
TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 1538 stars]

General Speech Recognition

wav2letter - Automatic Speech Recognition Toolkit [GitHub, 5798 stars]
DeepSpeech - Baidu's DeepSpeech architecture [GitHub, 17664 stars]
Acoustic Word Embeddings by Maria Obedkova [Blog, 2020]
kaldi - Kaldi is a toolkit for speech recognition [GitHub, 10595 stars]
awesome-kaldi - resources for using Kaldi [GitHub, 429 stars]
ESPnet - End-to-End Speech Processing Toolkit [GitHub, 3918 stars]
HuBERT - Self-supervised representation learning for speech recognition, generation, and compression [Blog, June 2021]

Text to Speech

FastSpeech - The Implementation of FastSpeech based on pytorch [GitHub, 630 stars]
TTS - a deep learning toolkit for Text-to-Speech [GitHub, 1883 stars]

Blogs

Topic Modelling with PySpark and Spark NLP by Maria Obedkova [Spark, Blog, 2020]

Frameworks for Topic Modeling

gensim - framework for topic modeling [GitHub, 12224 stars]
Spark NLP [GitHub, 2234 stars]

Repositories

Top2Vec [GitHub, 1176 stars]
Anchored Correlation Explanation Topic Modeling [GitHub, 273 stars]
Topic Modeling in Embedding Spaces [GitHub, 352 stars] Paper
TopicNet - A high-level interface for BigARTM library [GitHub, 108 stars]
BERTopic - Leveraging BERT and a class-based TF-IDF to create easily interpretable topics [GitHub, 1227 stars]
OCTIS - A python package to optimize and evaluate topic models [GitHub, 196 stars]
Contextualized Topic Models [GitHub, 475 stars]

Text Rank

PyTextRank - PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension [GitHub, 1565 stars]
textrank - TextRank implementation for Python 3 [GitHub, 1037 stars]

RAKE - Rapid Automatic Keyword Extraction

rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 831 stars]
yake - Single-document unsupervised keyword extraction [GitHub, 704 stars]
RAKE-tutorial - A python implementation of the Rapid Automatic Keyword Extraction [GitHub, 352 stars]
rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 831 stars]

Other Approaches

flashtext - Extract Keywords from sentence or Replace keywords in sentences [GitHub, 4837 stars]
BERT-Keyword-Extractor - Deep Keyphrase Extraction using BERT [GitHub, 190 stars]
keyBERT - Minimal keyword extraction with BERT [GitHub, 683 stars]

NLP and ML Interpretability

Language Interpretability Tool (LIT) [GitHub, 2589 stars]
WhatLies - Toolkit to help visualise - what lies in word embeddings [GitHub, 274 stars]
Interpret-Text - Interpretability techniques and visualization dashboards for NLP models [GitHub, 257 stars]
InterpretML - Fit interpretable models. Explain blackbox machine learning [GitHub, 3924 stars]
ecco - Tools to visuals and explore NLP language models [GitHub, 815 stars]
NLP Profiler - A simple NLP library allows profiling datasets with text columns [GitHub, 196 stars]
transformers-interpret - Model explainability that works seamlessly with transformers [GitHub, 405 stars]

Ethics, Bias, and Equality in NLP

Bias in Natural Language Processing @EMNLP 2020 [Blog, Nov 2020]
Machine Learning as a Software Engineering Enterprise - NeurIPS 2020 Keynote [Presentation, Dec 2020]
Computational Ethics for NLP - course resources from the Carnegie Mellon University [Lecture Notes, Spring 2020]
Ethics in NLP - resources from ACLs Ethics in NLP track
The Institute for Ethical AI & Machine Learning
Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models [Paper, Feb 2021]

Adversarial Attacks for NLP

Privacy Considerations in Large Language Models [Blog, Dec 2020]
DeepWordBug - Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers [GitHub, 48 stars]

General Purpose

spaCy by Explosion AI [GitHub, 20858 stars]
flair by Zalando [GitHub, 10550 stars]
AllenNLP by AI2 [GitHub, 10311 stars]
stanza (former Stanford NLP) [GitHub, 5508 stars]
spaCy stanza [GitHub, 540 stars]
nltk [GitHub, 9978 stars]
gensim - framework for topic modeling [GitHub, 12224 stars]
pororo - Platform of neural models for natural language processing [GitHub, 941 stars]
NLP Architect - A Deep Learning NLP/NLU library by Intel® AI Lab [GitHub, 2697 stars]
FARM [GitHub, 1263 stars]
gobbli by RTI International [GitHub, 256 stars]
headliner - training and deployment of seq2seq models [GitHub, 229 stars]
SyferText - A privacy preserving NLP framework [GitHub, 178 stars]
DeText - Text Understanding Framework for Ranking and Classification Tasks [GitHub, 1135 stars]
TextHero - Text preprocessing, representation and visualization [GitHub, 2239 stars]
textblob - TextBlob: Simplified Text Processing [GitHub, 7725 stars]
AdaptNLP - A high level framework and library for NLP [GitHub, 326 stars]
textacy - NLP, before and after spaCy [GitHub, 1703 stars]
texar - Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow [GitHub, 2177 stars]
jiant - jiant is an NLP toolkit [GitHub, 1282 stars]

Data Augmentation

Tools

WildNLP Text manipulation library to test NLP models [GitHub, 65 stars]
snorkel Framework to generate training data [GitHub, 4692 stars]
NLPAug Data augmentation for NLP [GitHub, 2193 stars]
SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 323 stars]
faker - Python package that generates fake data for you [GitHub, 12753 stars]
textflint - Unified Multilingual Robustness Evaluation Toolkit for NLP [GitHub, 470 stars]
Parrot - Practical and feature-rich paraphrasing framework [GitHub, 244 stars]

Papers & Blogs

A Survey of Data Augmentation Approaches for NLP [Paper, May 2021]

Adversarial NLP Attacks

TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 1538 stars]
CleverHans - adversarial example library for constructing NLP attacks and building defenses [GitHub, 5184 stars]

Non-English oriented

textblob-de - TextBlob: Simplified Text Processing for German [GitHub, 86 stars]
Kashgari Transfer Learning with focus on Chinese [GitHub, 2141 stars]
Underthesea - Vietnamese NLP Toolkit [GitHub, 863 stars]

Transformer-oriented

transformers by HuggingFace [GitHub, 48261 stars]
Adapter Hub and its documentation - Adapter modules for Transformers [GitHub, 470 stars]
haystack - Transformers at scale for question answering & neural search. [GitHub, 2067 stars]

Dialog Systems and Speech

DeepPavlov by MIPT [GitHub, 5287 stars]
ParlAI by FAIR [GitHub, 7300 stars]
rasa - Framework for Conversational Agents [GitHub, 11670 stars]
wav2letter - Automatic Speech Recognition Toolkit [GitHub, 5798 stars]
ChatterBot - conversational dialog engine for creating chat bots [GitHub, 11280 stars]

Word/Sentence-embeddings oriented

MUSE A library for Multilingual Unsupervised or Supervised word Embeddings [GitHub, 2812 stars]
vecmap A framework to learn cross-lingual word embedding mappings [GitHub, 553 stars]
sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 5498 stars]

Multi-lingual tools

polyglot - Multi-lingual NLP Framework [GitHub, 1864 stars]
trankit - Light-Weight Transformer-based Python Toolkit for Multilingual NLP [GitHub, 488 stars]

Distributed NLP

Spark NLP [GitHub, 2234 stars]

Machine Translation

COMET -A Neural Framework for MT Evaluation [GitHub, 65 stars]
marian-nmt - Fast Neural Machine Translation in C++ [GitHub, 808 stars]
argos-translate - Open source neural machine translation in Python [GitHub, 646 stars]
Opus-MT - Open neural machine translation models and web services [GitHub, 139 stars]
dl-translate - A deep learning-based translation library built on Huggingface transformers [GitHub, 152 stars]

Entity and String Matching

PolyFuzz - Fuzzy string matching, grouping, and evaluation [GitHub, 356 stars]
pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 619 stars]
fuzzywuzzy - Fuzzy String Matching in Python [GitHub, 8292 stars]
jellyfish - approximate and phonetic matching of strings [GitHub, 1489 stars]
textdistance - Compute distance between sequences [GitHub, 1997 stars]
DeepMatcher - Compute distance between sequences [GitHub, 347 stars]

Discourse Analysis

ConvoKit - Cornell Conversational Analysis Toolkit [GitHub, 278 stars]

PII scrubbing

scrubadub - Clean personally identifiable information from dirty dirty text [GitHub, 240 stars]

General

Learn NLP the practical way [Blog, Nov. 2019]
Learn NLP the Stanford way (+Part 2) [Blog, Nov 2020]

Books

Dive into Deep Learning - An interactive deep learning book with code, math, and discussions
Natural Language Processing and Computational Linguistics - Speech, Morphology and Syntax (Cognitive Science)
Top NLP Books to Read 2020 - Blog post by Raymong Cheng [Blog, Sep 2020]

Courses

NLP Course | For You - Great and interactive course on NLP
OpenClass NLP - Natural language processing (NLP) assignments
Choosing the right course for a Practical NLP Engineer
12 Best Natural Language Processing Courses & Tutorials to Learn Online

Tutorials

nlp-tutorial - A list of NLP(Natural Language Processing) tutorials built on PyTorch [GitHub, 1233 stars]
nlp-tutorial - Natural Language Processing Tutorial for Deep Learning Researchers [GitHub, 9011 stars]
Hands-On NLTK Tutorial [GitHub, 437 stars]
Modern Practical Natural Language Processing [GitHub, 256 stars]

r/LanguageTechnology - NLP Reddit forum

Tokenization

tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production [GitHub, 4680 stars]
SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation [GitHub, 5191 stars]
SoMaJo - A tokenizer and sentence splitter for German and English web and social media texts [GitHub, 91 stars]

Data Augmentation and Weak Supervision

Libraries and Frameworks

WildNLP Text manipulation library to test NLP models [GitHub, 65 stars]
snorkel Framework to generate training data [GitHub, 4692 stars]
NLPAug Data augmentation for NLP [GitHub, 2193 stars]
SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 323 stars]
TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 1538 stars]

Blogs and Tutorials

A Visual Survey of Data Augmentation in NLP [Blog, 2020]
Weak Supervision: A New Programming Paradigm for Machine Learning [Blog, March 2019]

Named Entity Recognition (NER)

Datasets for Entity Recognition [GitHub, 970 stars]
Datasets to train supervised classifiers for Named-Entity Recognition [GitHub, 229 stars]
Bootleg - Self-Supervision for Named Entity Disambiguation at the Tail [GitHub, 102 stars]
Few-NERD - Large-scale, fine-grained manually annotated named entity recognition dataset [GitHub, 126 stars]

Relation Extraction

tacred-relation TACRED: position-aware attention model for relation extraction [GitHub, 286 stars]
tacrev TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [GitHub, 42 stars]
tac-self-attention Relation extraction with position-aware self-attention [GitHub, 59 stars]

Coreference Resolution

NeuralCoref 4.0: Coreference Resolution in spaCy with Neural Networks by HuggingFace [GitHub, 2327 stars]
coref - BERT and SpanBERT for Coreference Resolution [GitHub, 303 stars]

Domain Adaptation

Neural Adaptation in Natural Language Processing - curated list [GitHub, 161 stars]

Low Resource NLP

CMU LTI Low Resource NLP Bootcamp 2020 - CMU Language Technologies Institute low resource NLP bootcamp 2020 [GitHub, 492 stars]

Spell Correction

Gramformer - ramework for detecting, highlighting and correcting grammatical errors [GitHub, 588 stars]
NeuSpell - A Neural Spelling Correction Toolkit [GitHub, 211 stars]
SymSpellPy - Python port of SymSpell [GitHub, 448 stars]
Speller100 by Microsoft [Blog, Feb 2021]

Style Transfer for NLP

Styleformer - Neural Language Style Transfer framework [GitHub, 189 stars]

Automata Theory for NLP

pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 619 stars]

Obscene words detection

LDNOOBW - List of Dirty, Naughty, Obscene, and Otherwise Bad Words [GitHub, 1443 stars]

Reinforcement Learning for NLP

nlp-gym - NLPGym - A toolkit to develop RL agents to solve NLP tasks [GitHub, 91 stars]

AutoML / AutoNLP

AutoNLP - Faster and easier training and deployments of SOTA NLP models [GitHub, 591 stars]
TPOT - Python Automated Machine Learning tool [GitHub, 8109 stars]
Auto-PyTorch - Automatic architecture search and hyperparameter optimization for PyTorch [GitHub, 1263 stars]
HungaBunga - Brute-Force all sklearn models with all parameters using .fit .predict [GitHub, 625 stars]
AutoML Natural Language - Google's paid AutoML NLP service

Text Generation

keytotext - a model which will take keywords as inputs and generate sentences as outputs [GitHub, 160 stars]
Controllable Neural Text Generation [Blog, Jan 2021]

License CC0

Attributions

Resources

All linked resources belong to original authors

Icons

Akropolis by parkjisun from the Noun Project
Book of Ester by Gilad Sotil from the Noun Project
quill by Juan Pablo Bravo from the Noun Project
acting by Flatart from the Noun Project
olympic by supalerk laipawat from the Noun Project
aristocracy by Eucalyp from the Noun Project
Horn by Eucalyp from the Noun Project
temple by Eucalyp from the Noun Project
constellation by Eucalyp from the Noun Project
ancient greek round pattern by Olena Panasovska from the Noun Project
Harp by Vectors Point from the Noun Project
Atlas by parkjisun from the Noun Project
Parthenon by Eucalyp from the Noun Project
papyrus by IconMark from the Noun Project
papyrus by Smalllike from the Noun Project
pegasus by Saeful Muslim from the Noun Project

Fonts

Dalek Font

mnrclab/The-NLP-Pandect

Compendiums and awesome lists on the topic of NLP:

NLP Conferences, Paper Summaries and Paper Compendiums:

Papers and Paper Summaries

Conferences

NLP Progress and NLP Tasks:

NLP Datasets:

Word and Sentence embeddings:

Notebooks, Scripts and Repositories

Non-English resources and compendiums

Pre-trained NLP models

NLP Year in Review

2020

NLP-only podcasts

Many NLP episodes

Some NLP episodes

General NLU

Summarization

Question Answering

Multilingual and Non-English Benchmarks

Bio, Law, and other scientific domains

Transformer Efficiency

Speech Processing

Other

General

Embeddings

Repositories

Blogs

Cross-lingual Word and Sentence Embeddings

Byte Pair Encoding

Transformer-based Architectures

General

Transformer

BERT

Other Transformer Variants

T5

BigBird

Reformer / Linformer / Longformer / Performers

Switch Transformer

GPT-family

General

GPT-3

Learning Resources

Applications

Open-source Efforts

Other

Distillation, Pruning and Quantization

Automated Summarization

Rule-based NLP

Best Practices for NLP

Transformer-based Architectures

Embeddings as a Service

NLP Recipes Industrial Applications:

NLP Applications in Bio, Finance, Legal and other industries

Model and Data testing

General Speech Recognition

Text to Speech

Blogs

Frameworks for Topic Modeling

Repositories

Text Rank

RAKE - Rapid Automatic Keyword Extraction

Other Approaches

Further Reading

NLP and ML Interpretability

Ethics, Bias, and Equality in NLP

Adversarial Attacks for NLP

General Purpose

Data Augmentation

Tools

Papers & Blogs

Adversarial NLP Attacks

Non-English oriented

Transformer-oriented

Dialog Systems and Speech

Word/Sentence-embeddings oriented

Multi-lingual tools

Distributed NLP

Machine Translation

Entity and String Matching