/The-NLP-Pandect

A comprehensive reference for all topics related to Natural Language Processing

Creative Commons Zero v1.0 UniversalCC0-1.0

The-NLP-Pandect

This pandect (πανδέκτης is Ancient Greek for encyclopedia) was created to help you find almost anything related to Natural Language Processing that is available online.

The-NLP-Resources

Compendiums and awesome lists on the topic of NLP:

NLP Conferences, Paper Summaries and Paper Compendiums:

NLP Progress and NLP Tasks:

NLP Datasets:

Word and Sentence embeddings:

Notebooks, Scripts and Repositories

Non-English resources and compendiums

Pre-trained NLP models

The-NLP-Podcasts

The-NLP-Newsletter

The-NLP-Meetups

The-NLP-Youtube

The-NLP-Benchmarks

  • SQuAD - Stanford Question Answering Dataset (SQuAD)
  • GLUE - General Language Understanding Evaluation (GLUE) benchmark
  • SuperGLUE - benchmark styled after GLUE with a new set of more difficult language understanding tasks
  • XTREME - Massively Multilingual Multi-task Benchmark
  • decaNLP - The Natural Language Decathlon (decaNLP) for studying general NLP models
  • RACE - ReAding Comprehension dataset collected from English Examinations

The-NLP-Research

General

Embeddings

Repositories

Blogs

Byte Pair Encoding

  • bpemb - Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) [GitHub ~800 stars]
  • subword-nmt - Unsupervised Word Segmentation for Neural Machine Translation and Text Generation [GitHub ~1500 stars]
  • python-bpe - Byte Pair Encoding for Python [GitHub ~100stars]

Transformer-based Architectures

General

Transformer

BERT

T5

GPT-family

General
GPT-3

BigBird

Other

Distillation, Pruning and Quantization

Automated Summarization

The-NLP-Industry

Transformer-based Architectures

Embeddings as a Service

NLP Recipes Industrial Applications:

NLP Applications in Bio, Finance, Legal and other industries

The-NLP-Speech

General Speech Recognition

  • wav2letter - Automatic Speech Recognition Toolkit [GitHub ~5k stars]
  • DeepSpeech - Baidu's DeepSpeech architecture [GitHub ~14k stars]
  • Acoustic Word Embeddings by Maria Obedkova [Blog, 2020]
  • kaldi - Kaldi is a toolkit for speech recognition [GitHub ~9k stars]
  • awesome-kaldi - resources for using Kaldi [GitHub ~300 stars]

Text to Speech

  • FastSpeech - The Implementation of FastSpeech based on pytorch [GitHub ~500 stars]

The-NLP-Topics

Blogs

Frameworks for Topic Modeling

  • gensim - framework for topic modeling [GitHub ~11k stars]
  • Spark NLP [Github ~1k stars]

Repositories

The-NLP-Frameworks

General Purpose

  • spaCy by Explosion AI [GitHub ~17k stars]
  • flair by Zalando [Github ~9k stars]
  • AllenNLP by AI2 [Github ~9k stars]
  • stanza (former Stanford NLP) [GitHub ~4k stars]
  • spaCy stanza [GitHub ~400 stars]
  • nltk [GitHub ~9k stars]
  • gensim - framework for topic modeling [GitHub ~11k stars]
  • NLP Architect - A Deep Learning NLP/NLU library by Intel® AI Lab [GitHub ~2.5k stars]
  • polyglot - Multi-lingual NLP Framework [Github ~2k stars]
  • FARM [GitHub ~1k stars]
  • gobbli by RTI International [GitHub ~200 stars]
  • headliner - training and deployment of seq2seq models [GitHub ~200 stars]
  • SyferText - A privacy preserving NLP framework [GitHub ~100 stars]
  • DeText - Text Understanding Framework for Ranking and Classification Tasks [GitHub ~600 stars]
  • TextHero - Text preprocessing, representation and visualization [GitHub ~2k stars]
  • textblob - TextBlob: Simplified Text Processing [GitHub ~7k stars]
  • AdaptNLP - A high level framework and library for NLP [GitHub ~200 stars]
  • TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub ~800 stars]

Non-English oriented

  • textblob-de - TextBlob: Simplified Text Processing for German [GitHub ~100 stars]
  • Kashgari Transfer Learning with focus on Chinese [GitHub ~2k stars]
  • Underthesea - Vietnamese NLP Toolkit [GitHub ~800 stars]

Transformer-oriented

Dialog Systems and Speech

  • DeepPavlov by MIPT [Github ~4k stars]
  • ParlAI by FAIR [Github ~6k stars]
  • rasa - Framework for Conversational Agents [GitHub ~9k stars]
  • wav2letter - Automatic Speech Recognition Toolkit [GitHub ~5k stars]

Distributed NLP

Other NLP Topics

General

Tokenization

  • tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production [GitHub ~3k stars]
  • SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation [GitHub ~4k stars]
  • SoMaJo - A tokenizer and sentence splitter for German and English web and social media texts [GitHub ~100 stars]

Data Augmentation and Weak Supervision

NLP Interpretability

Ethics, Bias, and Equality in NLP

The-NLP-Learning

Books

Courses

Tutorials

The-NLP-Communities

License CC0

Attributions

Resources

  • All linked resources belong to original authors

Icons

Fonts