This pandect (πανδέκτης is Ancient Greek for encyclopedia) was created to help you find almost anything related to Natural Language Processing that is available online.
- The NLP Index by Quantum Stat / NLP Cypher
- Awesome NLP by keon [GitHub, 13102 stars]
- Speech and Natural Language Processing Awesome List by elaboshira [GitHub, 2085 stars]
- Awesome Deep Learning for Natural Language Processing (NLP) [GitHub, 1042 stars]
- Text Mining and Natural Language Processing Resources by stepthom [GitHub, 476 stars]
- Made with ML List by madewithml.com
- Brainsources for #NLP enthusiasts by Philip Vollet
- Awesome AI/ML/DL - NLP Section [GitHub, 1040 stars]
- Resources on various machine learning topics by Backprop
- 100 Must-Read NLP Papers 100 Must-Read NLP Papers [GitHub, 3327 stars]
- NLP Paper Summaries by dair-ai [GitHub, 1408 stars]
- Curated collection of papers for the NLP practitioner [GitHub, 1060 stars]
- Papers on Textual Adversarial Attack and Defense [GitHub, 1072 stars]
- The Most Influential NLP Research of 2019
- Recent Deep Learning papers in NLU and RL by Valentin Malykh [GitHub, 289 stars]
- Some Notable Recent ML Papers and Future Trends by Aran Komatsuzaki [Blog, Oct. 2020]
- A Survey of Surveys (NLP & ML): Collection of NLP Survey Papers [GitHub, 1547 stars]
- A Paper List for Style Transfer in Text [GitHub, 1332 stars]
- Video recordings index for papers
- NLP top 10 conferences Compendium by soulbliss [GitHub, 416 stars]
- NLP Conferences Calendar
- ICLR 2020 Trends
- SpacyIRL 2019 Conference in Overview
- Paper Digest - Conferences and Papers in Overview
- Video Recordings from Conferences
- NLP Progress by sebastianruder [GitHub, 20004 stars]
- NLP Tasks by Kyubyong [GitHub, 2979 stars]
- Reading list for Awesome Sentiment Analysis papers by declare-lab [GitHub, 432 stars]
- Awesome Sentiment Analysis by xiamx [GitHub, 864 stars]
- NLP Datasets by niderhoff [GitHub, 5034 stars]
- Datasets by Huggingface [GitHub, 13099 stars]
- Big Bad NLP Database
- 25 Best Parallel Text Datasets for Machine Translation Training
- UWA Unambiguous Word Annotations - Word Sense Disambiguation Dataset
- 20 Best German Language Datasets for Machine Learning
- Awesome Embedding Models by Hironsan [GitHub, 1500 stars]
- Awesome list of Sentence Embeddings by Separius [GitHub, 2023 stars]
- Awesome BERT by Jiakui [GitHub, 1770 stars]
- The Super Duper NLP Repo [Website, 2020]
- NLP Resources for Bahasa Indonesian [GitHub, 272 stars]
- Indic NLP Catalog [GitHub, 338 stars]
- Pre-trained language models for Vietnamese [GitHub, 434 stars]
- Natural Language Toolkit for Indic Languages (iNLTK) [GitHub, 754 stars]
- Indic NLP Library [GitHub, 416 stars]
- AI4Bharat-IndicNLP Portal
- ARBML - Implementation of many Arabic NLP and ML projects [GitHub, 225 stars]
- zemberek-nlp - NLP tools for Turkish [GitHub, 984 stars]
- KLUE - Korean Language Understanding Evaluation [GitHub, 427 stars]
- Persian NLP Benchmark - benchmark for evaluation and comparison of various NLP tasks in Persian language [GitHub, 62 stars]
- List of pre-trained NLP models [GitHub, 157 stars]
- Pretrained language models developed by Huawei Noah's Ark Lab [GitHub, 2238 stars]
- Spanish Language Models and resources [GitHub, 189 stars]
- History of Natural Language Processing
- A Review of the Neural History of Natural Language Processing [Blog, October 2018]
- Natural Language Processing in 2020: The Year In Review [Blog, December 2020]
- ML and NLP Research Highlights of 2020 [Blog, January 2021]
- NLP Highlights [Years: 2017 - now, Status: active]
- The NLP Zone Episodes [Years: 2021 - now, Status: active]
- TWIML AI [Years: 2016 - now, Status: active]
- Practical AI [Years: 2018 - now, Status: active]
- The Data Exchange [Years: 2019 - now, Status: active]
- Gradient Dissent [Years: 2020 - now, Status: active]
- Machine Learning Street Talk [Years: 2020 - now, Status: active]
- DataFramed - latest trends and insights on how to scale the impact of data science in organizations [Years: 2019 - now, Status: active]
- The Super Data Science Podcast [Years: 2016 - now, Status: active]
- Data Hack Radio [Years: 2018 - now, Status: active]
- AI Game Changers [Years: 2020 - now, Status: active]
- The Analytics Show [Years: 2019 - now, Status: active]
- NLP News by Sebastian Ruder
- dair.ai Newsletter by dair.ai
- This Week in NLP by Robert Dale
- Papers with Code
- The Batch by deeplearning.ai
- Paper Digest by PaperDigest
- NLP Cypher by QuantumStat
- NLP Zurich [YouTube Recordings]
- NY-NLP (New York)
- Online NLP Meetup
- Hacking-Machine-Learning [YouTube Recordings]
- Yannic Kilcher
- HuggingFace
- Kaggle Reading Group
- Rasa Paper Reading
- Stanford CS224N: NLP with Deep Learning
- NLPxing
- ML Explained - A.I. Socratic Circles - AISC
- Deeplearning.ai
- Machine Learning Street Talk
- GLUE - General Language Understanding Evaluation (GLUE) benchmark
- SuperGLUE - benchmark styled after GLUE with a new set of more difficult language understanding tasks
- decaNLP - The Natural Language Decathlon (decaNLP) for studying general NLP models
- RACE - ReAding Comprehension dataset collected from English Examinations
- dialoglue - DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue
- DynaBench - Dynabench is a research platform for dynamic data collection and benchmarking
- WikiAsp - WikiAsp: Multi-document aspect-based summarization Dataset
- WikiLingua - A Multilingual Abstractive Summarization Dataset
- SQuAD - Stanford Question Answering Dataset (SQuAD)
- XQuad - XQuAD (Cross-lingual Question Answering Dataset) for cross-lingual question answering
- GrailQA - Strongly Generalizable Question Answering (GrailQA)
- CSQA - Complex Sequential Question Answering
- XTREME - Massively Multilingual Multi-task Benchmark
- GLUECoS - A benchmark for code-switched NLP
- IndoNLU Benchmark - collection of resources for training, evaluating, and analyzing NLP for Bahasa Indonesia
- IndicGLUE - Natural Language Understanding Benchmark for Indic Languages
- LinCE - Linguistic Code-Switching Evaluation Benchmark
- Russian SuperGlue - Russian SuperGlue Benchmark
- BLURB - Biomedical Language Understanding and Reasoning Benchmark
- BLUE - Biomedical Language Understanding Evaluation benchmark
- LexGLUE - A Benchmark Dataset for Legal Language Understanding in English
- Long-Range Arena - Long Range Arena for Benchmarking Efficient Transformers (Pre-print) [GitHub, 421 stars]
- SUPERB - Speech processing Universal PERformance Benchmark
- CodeXGLUE - A benchmark dataset for code intelligence
- CrossNER - CrossNER: Evaluating Cross-Domain Named Entity Recognition
- MultiNLI - Multi-Genre Natural Language Inference corpus
- iSarcasm: A Dataset of Intended Sarcasm - iSarcasm is a dataset of tweets, each labelled as either sarcastic or non_sarcastic
- A Recipe for Training Neural Networks by Andrej Karpathy [Keywords: research, training, 2019]
- Pre-trained ELMo Representations for Many Languages [GitHub, 1385 stars]
- sense2vec - Contextually-keyed word vectors [GitHub, 1366 stars]
- wikipedia2vec [GitHub, 767 stars]
- StarSpace [GitHub, 3747 stars]
- fastText [GitHub, 23558 stars]
- Language Models and Contextualised Word Embeddings by David S. Batista [Blog, 2018]
- An Essential Guide to Pretrained Word Embeddings for NLP Practitioners by AnalyticsVidhya [Blog, 2020]
- Polyglot Word Embeddings Discover Language Clusters [Blog, 2020]
- The Illustrated Word2vec by Jay Alammar [Blog, 2019]
- vecmap - VecMap (cross-lingual word embedding mappings) [GitHub, 583 stars]
- sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 7543 stars]
- bpemb - Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) [GitHub, 1055 stars]
- subword-nmt - Unsupervised Word Segmentation for Neural Machine Translation and Text Generation [GitHub, 1892 stars]
- python-bpe - Byte Pair Encoding for Python [GitHub, 171 stars]
- The Transformer Family by Lilian Weng [Blog, 2020]
- Keeping up with the BERTs: a review of the main NLP benchmarks by Manuel Tonneau [Blog, 2020]
- Playing the lottery with rewards and multiple languages - about the effect of random initialization [ICLR 2020 Paper]
- Attention? Attention! by Lilian Weng [Blog, 2018]
- the transformer … “explained”? [Blog, 2019]
- Attention is all you need; Attentional Neural Network Models by Łukasz Kaiser [Talk, 2017]
- Understanding and Applying Self-Attention for NLP [Talk, 2018]
- The NLP Cookbook: Modern Recipes for Transformer based Deep Learning Architectures [Paper, April 2021]
- Pre-Trained Models: Past, Present and Future [Paper, June 2021]
- A Survey of Transformers [Paper, June 2021]
- The Annotated Transformer by Harvard NLP [Blog, 2018]
- The Illustrated Transformer by Jay Alammar [Blog, 2018]
- Illustrated Guide to Transformers by Hong Jing [Blog, 2020]
- Sequential Transformer with Adaptive Attention Span by Facebook. Blog [Blog, 2019]
- Evolution of Representations in the Transformer by Lena Voita [Blog, 2019]
- Reformer: The Efficient Transformer [Blog, 2020]
- Longformer — The Long-Document Transformer by Viktor Karlsson [Blog, 2020]
- TRANSFORMERS FROM SCRATCH [Blog, 2019]
- Universal Transformers by Mostafa Dehghani [Blog, 2019]
- Transformers in Natural Language Processing — A Brief Survey by George Ho [Blog, May 2020]
- Lite Transformer - Lite Transformer with Long-Short Range Attention [GitHub, 526 stars]
- Transformers from Scratch [Blog, Oct 2021]
- A Visual Guide to Using BERT for the First Time by Jay Alammar [Blog, 2019]
- The Dark Secrets of BERT by Anna Rogers [Blog, 2020]
- Understanding searches better than ever before [Blog, 2019]
- Demystifying BERT: A Comprehensive Guide to the Groundbreaking NLP Framework [Blog, 2019]
- SemBERT - Semantics-aware BERT for Language Understanding [GitHub, 258 stars]
- BERTweet - BERTweet: A pre-trained language model for English Tweets [GitHub, 431 stars]
- Optimal Subarchitecture Extraction for BERT [GitHub, 453 stars]
- CharacterBERT: Reconciling ELMo and BERT [GitHub, 143 stars]
- When BERT Plays The Lottery, All Tickets Are Winning [Blog, Dec 2020]
- BERT-related Papers a list of BERT-related papers [GitHub, 1853 stars]
- T5 Understanding Transformer-Based Self-Supervised Architectures [Blog, August 2020]
- T5: the Text-To-Text Transfer Transformer [Blog, 2020]
- multilingual-t5 - Multilingual T5 (mT5) is a massively multilingual pretrained text-to-text transformer model [GitHub, 833 stars]
- Big Bird: Transformers for Longer Sequences original paper by Google Research [Paper, July 2020]
- Reformer: The Efficient Transformer - [Paper, February 2020] [Video, October 2020]
- Longformer: The Long-Document Transformer - [Paper, April 2020] [Video, April 2020]
- Linformer: Self-Attention with Linear Complexity - [Paper, June 2020] [Video, June 2020]
- Rethinking Attention with Performers - [Paper, September 2020] [Video, September 2020]
- performer-pytorch - An implementation of Performer, a linear attention-based transformer, in Pytorch [GitHub, 817 stars]
- Switch Transformers: Scaling to Trillion Parameter Models original paper by Google Research [Paper, January 2021]
- The Illustrated GPT-2 by Jay Alammar [Blog, 2019]
- The Annotated GPT-2 by Aman Arora
- OpenAI’s GPT-2: the model, the hype, and the controversy by Ryan Lowe [Blog, 2019]
- How to generate text by Patrick von Platen [Blog, 2020]
- Zero Shot Learning for Text Classification by Amit Chaudhary [Blog, 2020]
- GPT-3 A Brief Summary by Leo Gao [Blog, 2020]
- GPT-3, a Giant Step for Deep Learning And NLP by Yoel Zeldes [Blog, June 2020]
- GPT-3 Language Model: A Technical Overview by Chuan Li [Blog, June 2020]
- Is it possible for language models to achieve language understanding? by Christopher Potts
- Awesome GPT-3 - list of all resources related to GPT-3 [GitHub, 3495 stars]
- GPT-3 Projects - a map of all GPT-3 start-ups and commercial projects
- GPT-3 Demo Showcase - GPT-3 Demo Showcase, 180+ Apps, Examples, & Resources
- OpenAI API - API Demo to use GPT-3 for commercial applications
- GPT-Neo - in-progress GPT-3 open source replication HuggingFace Hub
- GPT-J - A 6 billion parameter, autoregressive text generation model trained on The Pile
- Effectively using GPT-J with few-shot learning [Blog, July 2021]
- What is Two-Stream Self-Attention in XLNet by Xu LIANG [Blog, 2019]
- Visual Paper Summary: ALBERT (A Lite BERT) by Amit Chaudhary [Blog, 2020]
- Turing NLG by Microsoft
- Multi-Label Text Classification with XLNet by Josh Xin Jie Lee [Blog, 2019]
- ELECTRA [GitHub, 2007 stars]
- Performer implementation of Performer, a linear attention-based transformer, in Pytorch [GitHub, 817 stars]
- Distilling knowledge from Neural Networks to build smaller and faster models by FloydHub [Blog, 2019]
- David over Goliath: towards smaller models for cheaper, faster, and greener NLP by Manuel Tonneau [Blog, 2020]
- Compression of Deep Learning Models for Text: A Survey (+Overview of Approaches) [Paper, April 2021]
- Bert-squeeze - code to reduce the size of Transformer-based models or decrease their latency at inference time [GitHub, 55 stars]
- XtremeDistil - XtremeDistilTransformers for Distilling Massive Multilingual Neural Networks [GitHub, 101 stars]
- PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization by Google AI [Blog, June 2020]
- CTRLsum - CTRLsum: Towards Generic Controllable Text Summarization [GitHub, 103 stars]
- XL-Sum - XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages [GitHub, 145 stars]
- SummerTime - an open-source text summarization toolkit for non-experts [GitHub, 188 stars]
- PRIMER - PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization [GitHub, 65 stars]
- summarus - Models for automatic abstractive summarization [GitHub, 138 stars]
- Fusing Knowledge into Language Model [Presentation ,Oct 2021]
- In Search of Best Practices for NLP Projects [Slides, Dec. 2020]
- EMNLP 2020: High Performance Natural Language Processing by Google Research [Slides, Recording, Nov. 2020]
- Practical Natural Language Processing - A Comprehensive Guide to Building Real-World NLP Systems [Book, June 2020]
- How to Structure and Manage NLP Projects [Blog, May 2021]
- Applied NLP Thinking - Applied NLP Thinking: How to Translate Problems into Solutions [Blog, June 2021]
- Introduction to NLP for Industry Use - DataTalksClub presentation on Introduction to NLP for Industry Use [Recording, December 2021]
MLOps, especially when applied to NLP, is a set of best practices around automating various parts of the workflow when building and deploying NLP pipelines.
In general, MLOps for NLP includes having the following processes in place:
- Data Versioning - make sure your training, annotation and other types of data are versioned and tracked
- Experiment Tracking - make sure that all of your experiments are automatically tracked and saved where they can be easily replicated or retraced
- Model Registry - make sure any neural models you train are versioned and tracked and it is easy to roll back to any of them
- Automated Testing and Behavioral Testing - besides regular unit and integration tests, you want to have behavioral tests that check for bias or potential adversarial attacks
- Model Deployment and Serving - automate model deployment, ideally also with zero-downtime deploys like Blue/Green, Canary deploys etc.
- Data and Model Observability - track data drift, model accuracy drift etc.
Additionally, there are two more components that are not as prevalent for NLP and are mostly used for Computer Vision and other sub-fields of AI:
- Feature Store - centralized storage of all features developed for ML models than can be easily reused by any other ML project
- Metadata Management - storage for all information related to the usage of ML models, mainly for reproducing behavior of deployed ML models, artifact tracking etc.
- MLOps: What It Is, Why it Matters, and How To Implement It by Neptune AI [Blog, July 2021]
- Best MLOps Tools You Need to Know as a Data Scientist by Neptune AI [Blog, July 2021]
- Robust MLOps - Robust MLOps with Open-Source: ModelDB, Docker, Jenkins and Prometheus [Blog, May 2021]
- State of MLOps 2021 by Valohai [Blog, August 2021]
- The MLOps Stack by Valohai [Blog, October 2020]
- Data Version Control for Machine Learning Applications by Megagon AI [Blog, July 2021]
- The Rapid Evolution of the Canonical Stack for Machine Learning [Blog, July 2021]
- MLOps: Comprehensive Beginner’s Guide [Blog, March 2021]
- What I’ve learned about MLOps from speaking with 100+ ML practitioners [Blog, May 2021]
- DataRobot Challenger Models - MLOps Champion/Challenger Models
- State of MLOps Blog by Dr. Ori Cohen
- MLOps cource by Made With ML
- GitHub MLOps - collection of resources on how to facilitate Machine Learning Ops with GitHub
- The MLOps Community - blogs, slack group, newsletter and more all about MLOps
- DVC - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] Link to GitHub
- Weights & Biases - tools for experiment tracking and dataset versioning [Paid Service]
- Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
- mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
- Weights & Biases - tools for experiment tracking and dataset versioning [Paid Service]
- Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]
- Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
- SigOpt - automate training & tuning, visualize & compare runs [Paid Service]
- Optuna - hyperparameter optimization framework [GitHub, 6274 stars]
- Clear ML - experiment, orchestrate, deploy, and build data stores, all in one place [Free and Open Source] Link to GitHub
- Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 5535 stars]
- DVC - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] Link to GitHub
- mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
- ModelDB - open-source system for Machine Learning model versioning, metadata, and experiment management [GitHub, 1422 stars]
- Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]
- Valohai - End-to-end ML pipelines [Paid Service]
- Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
- polyaxon - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service]
- Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
- CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 1638 stars]
- TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 1940 stars]
- WildNLP - Corrupt an input text to test NLP models' robustness [GitHub, 72 stars]
- Great Expectations - Write tests for your data [GitHub, 6459 stars]
- Deepchecks - Python package for comprehensively validating your machine learning models and data [GitHub, 1378 stars]
- mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
- Amazon SageMaker [Paid Service]
- Valohai - End-to-end ML pipelines [Paid Service]
- NLP Cloud - Production-ready NLP API [Paid Service]
- Saturn Cloud [Paid Service]
- SELDON - machine learning deployment for enterprise [Paid Service]
- Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
- polyaxon - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service]
- TorchServe - flexible and easy to use tool for serving PyTorch models [GitHub, 2560 stars]
- Kubeflow - The Machine Learning Toolkit for Kubernetes [GitHub, 10600 stars]
- KFServing - Serverless Inferencing on Kubernetes [GitHub, 1455 stars]
- TFX - TensorFlow Extended (TFX) is an end-to-end platform for deploying production ML pipelines [Paid Service]
- Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
- Cortex - containers as a service on AWS [Paid Service]
- Azure Machine Learning - end-to-end machine learning lifecycle [Paid Service]
- End2End Serverless Transformers On AWS Lambda [GitHub, 100 stars]
- NLP-Service - sample demo of NLP as a service platform built using FastAPI and Hugging Face [GitHub, 13 stars]
- Dagster - data orchestrator for machine learning [Free and Open Source]
- Verta - AI and machine learning deployment and operations [Paid Service]
- Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 5535 stars]
- flyte - workflow automation platform for complex, mission-critical data and ML processes at scale [GitHub, 2180 stars]
- MLRun - Machine Learning automation and tracking [GitHub, 618 stars]
- DataRobot MLOps - DataRobot MLOps provides a center of excellence for your production AI
- imodels - package for concise, transparent, and accurate predictive modeling [GitHub, 727 stars]
- Cockpit - A Practical Debugging Tool for Training Deep Neural Networks [GitHub, 391 stars]
- WeightWatcher - WeightWatcher tool for predicting the accuracy of Deep Neural Networks [GitHub, 685 stars]
- whylogs - open source standard for data and ML logging [GitHub, 959 stars]
- Rubrix - open-source tool for exploring and iterating on data for artificial intelligence projects [GitHub, 1014 stars]
- MLRun - Machine Learning automation and tracking [GitHub, 618 stars]
- DataRobot MLOps - DataRobot MLOps provides a center of excellence for your production AI
- Cortex - containers as a service on AWS [Paid Service]
- Algorithmia - minimize risk with advanced reporting and enterprise-grade security and governance across all data, models, and infrastructure [Paid Service]
- Dataiku - dataiku is for teams who want to deliver advanced analytics using the latest techniques at big data scale [Paid Service]
- Evidently AI - tools to analyze and monitor machine learning models [Free and Open Source] Link to GitHub
- Fiddler - ML Model Performance Management Tool [Paid Service]
- Hydrosphere - open-source platform for managing ML models [Paid Service]
- Verta - AI and machine learning deployment and operations [Paid Service]
- Domino Model Ops - Deploy and Manage Models to Drive Business Impact [Paid Service]
- iguazio - deployment and management of your AI applications with MLOps and end-to-end automation of machine learning pipelines [Paid Service]
- Datafold - data quality through diffs, profiling, and anomaly detection [Paid Service]
- acceldata - improve reliability, accelerate scale, and reduce costs across all data pipelines [Paid Service]
- Bigeye - monitoring and alerting to your datasets in minutes [Paid Service]
- datakin - end-to-end, real-time data lineage solution [Paid Service]
- Monte Carlo - data integrity, drifts, schema, lineage [Paid Service]
- SODA - data monitoring, testing and validation [Paid Service]
- whatify - data quality and action recommendation on it [Paid Service]
- Tecton - enterprise feature store for machine learning [Paid Service]
- FEAST - open source feature store for machine learning Website [GitHub, 3111 stars]
- Hopsworks Feature Store - data management system for managing machine learning features [Paid Service]
- ML Metadata - a library for recording and retrieving metadata associated with ML developer and data scientist workflows [GitHub, 463 stars]
- Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]
- Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 5535 stars]
- kedro - Python framework for creating reproducible, maintainable and modular data science code [GitHub, 7119 stars]
- Seldon Core - MLOps framework to package, deploy, monitor and manage thousands of production machine learning models [GitHub, 3091 stars]
- ZenML - MLOps framework to create reproducible ML pipelines for production machine learning [GitHub, 1905 stars]
- Google Vertex AI - build, deploy, and scale ML models faster, with pre-trained and custom tooling within a unified AI platform [Paid Service]
- Diffgram - Complete training data platform for machine learning delivered as a single application [GitHub, 764 stars]
- Why BERT Fails in Commercial Environments by Intel AI [Blog, 2020]
- Fine Tuning BERT for Text Classification with FARM by Sebastian Guggisberg [Blog, 2020]
- Pretrain Transformers Models in PyTorch using Hugging Face Transformers [GitHub, 159 stars]
- Practical NLP for the Real World [Presentation, 2019]
- From Paper to Product – How we implemented BERT by Christoph Henkelmann [Talk, 2020]
- Parallelformers: An Efficient Model Parallelization Toolkit for Deployment [GitHub, 457 stars]
- Training BERT with Compute/Time (Academic) Budget [GitHub, 217 stars]
- embedding-as-service [GitHub, 169 stars]
- Bert-as-service [GitHub, 10211 stars]
- NLP Recipes by microsoft [GitHub, 5890 stars]
- NLP with Python by susanli2016 [GitHub, 2289 stars]
- Basic Utilities for PyTorch NLP by PetrochukM [GitHub, 2050 stars]
- Blackstone - A spaCy pipeline and model for NLP on unstructured legal text [GitHub, 537 stars]
- Sci spaCy - spaCy pipeline and models for scientific/biomedical documents [GitHub, 1141 stars]
- FinBERT: Pre-Trained on SEC Filings for Financial NLP Tasks [GitHub, 160 stars]
- LexNLP - Information retrieval and extraction for real, unstructured legal text [GitHub, 513 stars]
- NerDL and NerCRF - Tutorial on Named Entity Recognition for Healthcare with SparkNLP
- Legal Text Analytics - A list of selected resources dedicated to Legal Text Analytics [GitHub, 360 stars]
- BioIE - A curated list of resources relevant to doing Biomedical Information Extraction [GitHub, 198 stars]
- wav2letter - Automatic Speech Recognition Toolkit [GitHub, 6029 stars]
- DeepSpeech - Baidu's DeepSpeech architecture [GitHub, 19419 stars]
- Acoustic Word Embeddings by Maria Obedkova [Blog, 2020]
- kaldi - Kaldi is a toolkit for speech recognition [GitHub, 11503 stars]
- awesome-kaldi - resources for using Kaldi [GitHub, 488 stars]
- ESPnet - End-to-End Speech Processing Toolkit [GitHub, 4952 stars]
- HuBERT - Self-supervised representation learning for speech recognition, generation, and compression [Blog, June 2021]
- FastSpeech - The Implementation of FastSpeech based on pytorch [GitHub, 693 stars]
- TTS - a deep learning toolkit for Text-to-Speech [GitHub, 4598 stars]
- VoxPopuli - large-scale multilingual speech corpus for representation learning [GitHub, 349 stars]
- Topic Modelling with PySpark and Spark NLP by Maria Obedkova [Spark, Blog, 2020]
- A Unique Approach to Short Text Clustering (Algorithmic Theory) by Brittany Bowers [Blog, 2020]
- Top2Vec [GitHub, 1619 stars]
- Anchored Correlation Explanation Topic Modeling [GitHub, 279 stars]
- Topic Modeling in Embedding Spaces [GitHub, 435 stars] Paper
- TopicNet - A high-level interface for BigARTM library [GitHub, 121 stars]
- BERTopic - Leveraging BERT and a class-based TF-IDF to create easily interpretable topics [GitHub, 2246 stars]
- OCTIS - A python package to optimize and evaluate topic models [GitHub, 355 stars]
- Contextualized Topic Models [GitHub, 842 stars]
- GSDMM - GSDMM: Short text clustering [GitHub, 269 stars]
- PyTextRank - PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension [GitHub, 1783 stars]
- textrank - TextRank implementation for Python 3 [GitHub, 1103 stars]
- rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 905 stars]
- yake - Single-document unsupervised keyword extraction [GitHub, 1026 stars]
- RAKE-tutorial - A python implementation of the Rapid Automatic Keyword Extraction [GitHub, 361 stars]
- rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 905 stars]
- flashtext - Extract Keywords from sentence or Replace keywords in sentences [GitHub, 5149 stars]
- BERT-Keyword-Extractor - Deep Keyphrase Extraction using BERT [GitHub, 206 stars]
- keyBERT - Minimal keyword extraction with BERT [GitHub, 1403 stars]
- Adding a custom tokenizer to spaCy and extracting keywords from Chinese texts by Haowen Jiang [Blog, Feb 2021]
- How to Extract Relevant Keywords with KeyBERT [Blog, June 2021]
- Explainability for Natural Language Processing - KDD'2021 Tutorial Slides [Presentation, August 2021]
- ecco - Tools to visuals and explore NLP language models [GitHub, 1380 stars]
- NLP Profiler - A simple NLP library allows profiling datasets with text columns [GitHub, 218 stars]
- transformers-interpret - Model explainability that works seamlessly with transformers [GitHub, 611 stars]
- Awesome-explainable-AI - collection of research materials on explainable AI/ML [GitHub, 613 stars]
- LAMA - LAMA is a probe for analyzing the factual and commonsense knowledge contained in pretrained language models [GitHub, 891 stars]
- Language Interpretability Tool (LIT) [GitHub, 2903 stars]
- WhatLies - Toolkit to help visualise - what lies in word embeddings [GitHub, 348 stars]
- Interpret-Text - Interpretability techniques and visualization dashboards for NLP models [GitHub, 313 stars]
- InterpretML - Fit interpretable models. Explain blackbox machine learning [GitHub, 4682 stars]
- thermostat - Collection of NLP model explanations and accompanying analysis tools [GitHub, 97 stars]
- Dodrio - Exploring attention weights in transformer-based models with linguistic knowledge [GitHub, 215 stars]
- imodels - package for concise, transparent, and accurate predictive modeling [GitHub, 727 stars]
- Bias in Natural Language Processing @EMNLP 2020 [Blog, Nov 2020]
- Machine Learning as a Software Engineering Enterprise - NeurIPS 2020 Keynote [Presentation, Dec 2020]
- Computational Ethics for NLP - course resources from the Carnegie Mellon University [Lecture Notes, Spring 2020]
- Ethics in NLP - resources from ACLs Ethics in NLP track
- The Institute for Ethical AI & Machine Learning
- Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models [Paper, Feb 2021]
- Fairness-in-AI - this package is used to detect and mitigate biases in NLP tasks [GitHub, 12 stars]
- nlg-bias - dataset + classifier tools to study social perception biases in natural language generation [GitHub, 36 stars]
- Privacy Considerations in Large Language Models [Blog, Dec 2020]
- DeepWordBug - Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers [GitHub, 61 stars]
- Adversarial-Misspellings - Combating Adversarial Misspellings with Robust Word Recognition [GitHub, 49 stars]
- HateXplain - BERT for detecting abusive language [GitHub, 104 stars]
- spaCy by Explosion AI [GitHub, 23197 stars]
- flair by Zalando [GitHub, 11515 stars]
- AllenNLP by AI2 [GitHub, 10940 stars]
- stanza (former Stanford NLP) [GitHub, 6087 stars]
- spaCy stanza [GitHub, 627 stars]
- nltk [GitHub, 10671 stars]
- gensim - framework for topic modeling [GitHub, 13121 stars]
- pororo - Platform of neural models for natural language processing [GitHub, 1104 stars]
- NLP Architect - A Deep Learning NLP/NLU library by Intel® AI Lab [GitHub, 2814 stars]
- FARM [GitHub, 1506 stars]
- gobbli by RTI International [GitHub, 267 stars]
- headliner - training and deployment of seq2seq models [GitHub, 231 stars]
- SyferText - A privacy preserving NLP framework [GitHub, 185 stars]
- DeText - Text Understanding Framework for Ranking and Classification Tasks [GitHub, 1213 stars]
- TextHero - Text preprocessing, representation and visualization [GitHub, 2490 stars]
- textblob - TextBlob: Simplified Text Processing [GitHub, 8126 stars]
- AdaptNLP - A high level framework and library for NLP [GitHub, 394 stars]
- textacy - NLP, before and after spaCy [GitHub, 1916 stars]
- texar - Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow [GitHub, 2272 stars]
- jiant - jiant is an NLP toolkit [GitHub, 1397 stars]
- WildNLP Text manipulation library to test NLP models [GitHub, 72 stars]
- snorkel Framework to generate training data [GitHub, 5114 stars]
- NLPAug Data augmentation for NLP [GitHub, 3170 stars]
- SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 354 stars]
- faker - Python package that generates fake data for you [GitHub, 14052 stars]
- textflint - Unified Multilingual Robustness Evaluation Toolkit for NLP [GitHub, 551 stars]
- Parrot - Practical and feature-rich paraphrasing framework [GitHub, 427 stars]
- AugLy - data augmentations library for audio, image, text, and video [GitHub, 4421 stars]
- TextAugment - Python 3 library for augmenting text for natural language processing applications [GitHub, 232 stars]
- TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 1940 stars]
- CleverHans - adversarial example library for constructing NLP attacks and building defenses [GitHub, 5476 stars]
- CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 1638 stars]
- transformers by HuggingFace [GitHub, 61482 stars]
- Adapter Hub and its documentation - Adapter modules for Transformers [GitHub, 782 stars]
- haystack - Transformers at scale for question answering & neural search. [GitHub, 4540 stars]
- DeepPavlov by MIPT [GitHub, 5685 stars]
- ParlAI by FAIR [GitHub, 8778 stars]
- rasa - Framework for Conversational Agents [GitHub, 13867 stars]
- wav2letter - Automatic Speech Recognition Toolkit [GitHub, 6029 stars]
- ChatterBot - conversational dialog engine for creating chat bots [GitHub, 12181 stars]
- SpeechBrain - open-source and all-in-one speech toolkit based on PyTorch [GitHub, 3946 stars]
- MUSE A library for Multilingual Unsupervised or Supervised word Embeddings [GitHub, 2945 stars]
- vecmap A framework to learn cross-lingual word embedding mappings [GitHub, 583 stars]
- sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 7543 stars]
- Ekphrasis - text processing tool, geared towards text from social networks [GitHub, 548 stars]
- DeepPhonemizer - grapheme to phoneme conversion with deep learning [GitHub, 138 stars]
- LemmInflect - python module for English lemmatization and inflection [GitHub, 153 stars]
- Inflect - generate plurals, ordinals, indefinite articles [GitHub, 627 stars]
- simplemma - simple multilingual lemmatizer for Python [GitHub, 627 stars]
- polyglot - Multi-lingual NLP Framework [GitHub, 1989 stars]
- trankit - Light-Weight Transformer-based Python Toolkit for Multilingual NLP [GitHub, 593 stars]
- Spark NLP [GitHub, 2690 stars]
- Parallelformers: An Efficient Model Parallelization Toolkit for Deployment [GitHub, 457 stars]
- COMET -A Neural Framework for MT Evaluation [GitHub, 140 stars]
- marian-nmt - Fast Neural Machine Translation in C++ [GitHub, 891 stars]
- argos-translate - Open source neural machine translation in Python [GitHub, 1144 stars]
- Opus-MT - Open neural machine translation models and web services [GitHub, 200 stars]
- dl-translate - A deep learning-based translation library built on Huggingface transformers [GitHub, 202 stars]
- PolyFuzz - Fuzzy string matching, grouping, and evaluation [GitHub, 451 stars]
- pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 701 stars]
- fuzzywuzzy - Fuzzy String Matching in Python [GitHub, 8654 stars]
- jellyfish - approximate and phonetic matching of strings [GitHub, 1648 stars]
- textdistance - Compute distance between sequences [GitHub, 2792 stars]
- DeepMatcher - Compute distance between sequences [GitHub, 415 stars]
- RE2 - Simple and Effective Text Matching with Richer Alignment Features [GitHub, 324 stars]
- Machamp - Machamp: A Generalized Entity Matching Benchmark [GitHub, 6 stars]
- ConvoKit - Cornell Conversational Analysis Toolkit [GitHub, 347 stars]
- scrubadub - Clean personally identifiable information from dirty dirty text [GitHub, 285 stars]
- hashformers - automatically inserting the missing spaces between the words in a hashtag [GitHub, 38 stars]
- booknlp - a natural language processing pipeline that scales to books and other long documents (in English) [GitHub, 547 stars]
- bookworm - ingests novels, builds an implicit character network and a deeply analysable graph [GitHub, 70 stars]
- fugashi - Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis [GitHub, 215 stars]
- SudachiPy - SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer [GitHub, 290 stars]
- Konoha - easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code [GitHub, 172 stars]
- jProcessing - Japanese Natural Langauge Processing Libraries [GitHub, 142 stars]
- Ginza - Japanese NLP Library using spaCy as framework based on Universal Dependencies [GitHub, 565 stars]
- kuromoji - self-contained and very easy to use Japanese morphological analyzer designed for search [GitHub, 808 stars]
- nagisa - Japanese tokenizer based on recurrent neural networks [GitHub, 295 stars]
- KyTea - Kyoto Text Analysis Toolkit for word segmentation and pronunciation estimation [GitHub, 186 stars]
- Jigg - Pipeline framework for easy natural language processing [GitHub, 69 stars]
- Juman++ - Juman++ (a Morphological Analyzer Toolkit) [GitHub, 283 stars]
- RakutenMA - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript [GitHub, 442 stars]
- toiro - a comparison tool of Japanese tokenizers [GitHub, 101 stars]
- textblob-de - TextBlob: Simplified Text Processing for German [GitHub, 93 stars]
- Kashgari Transfer Learning with focus on Chinese [GitHub, 2294 stars]
- Underthesea - Vietnamese NLP Toolkit [GitHub, 959 stars]
- PTT5 - Pretraining and validating the T5 model on Brazilian Portuguese data [GitHub, 53 stars]
- Small-Text - Active Learning for Text Classifcation in Python [GitHub, 232 stars]
- Doccano - open source annotation tool for machine learning practitioners [GitHub, 6043 stars]
- Prodigy - annotation tool powered by active learning [Paid Service]
- Learn NLP the practical way [Blog, Nov. 2019]
- Learn NLP the Stanford way (+Part 2) [Blog, Nov 2020]
- Choosing the right course for a Practical NLP Engineer
- 12 Best Natural Language Processing Courses & Tutorials to Learn Online
- Treasure of Transformers - Natural Language processing papers, videos, blogs, official repos along with colab Notebooks [GitHub, 272 stars]
- NLP Course | For You - Great and interactive course on NLP
- OpenClass NLP - Natural language processing (NLP) assignments
- Advanced NLP with spaCy - how to use spaCy to build advanced natural language understanding systems
- Transformer models for NLP by HuggingFace
- Stanford NLP Seminar - slides from the Stanford NLP course
- Applied Language Technology - Natural Language Processing for Linguists
- Natural Language Processing with Transformers - [Book, February 2022]
- Applied Natural Language Processing in the Enterprise - [Book, May 2021]
- Practical Natural Language Processing - [Book, June 2020]
- Dive into Deep Learning - An interactive deep learning book with code, math, and discussions
- Natural Language Processing and Computational Linguistics - Speech, Morphology and Syntax (Cognitive Science)
- Top NLP Books to Read 2020 - Blog post by Raymong Cheng [Blog, Sep 2020]
- nlp-tutorial - A list of NLP(Natural Language Processing) tutorials built on PyTorch [GitHub, 1308 stars]
- nlp-tutorial - Natural Language Processing Tutorial for Deep Learning Researchers [GitHub, 10510 stars]
- Hands-On NLTK Tutorial [GitHub, 477 stars]
- Modern Practical Natural Language Processing [GitHub, 259 stars]
- Transformers-Tutorials - demos with the Transformers library by HuggingFace [GitHub, 1305 stars]
- r/LanguageTechnology - NLP Reddit forum
- tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production [GitHub, 5537 stars]
- SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation [GitHub, 5824 stars]
- SoMaJo - A tokenizer and sentence splitter for German and English web and social media texts [GitHub, 103 stars]
- WildNLP Text manipulation library to test NLP models [GitHub, 72 stars]
- NLPAug Data augmentation for NLP [GitHub, 3170 stars]
- SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 354 stars]
- TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 1940 stars]
- skweak - software toolkit for weak supervision applied to NLP tasks [GitHub, 687 stars]
- NL-Augmenter - Collaborative Repository of Natural Language Transformations [GitHub, 597 stars]
- EDA - Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [GitHub, 1232 stars]
- snorkel Framework to generate training data [GitHub, 5114 stars]
- A Survey of Data Augmentation Approaches for NLP [Paper, May 2021] GitHub Link
- A Visual Survey of Data Augmentation in NLP [Blog, 2020]
- Weak Supervision: A New Programming Paradigm for Machine Learning [Blog, March 2019]
- Datasets for Entity Recognition [GitHub, 1140 stars]
- Datasets to train supervised classifiers for Named-Entity Recognition [GitHub, 276 stars]
- Bootleg - Self-Supervision for Named Entity Disambiguation at the Tail [GitHub, 172 stars]
- Few-NERD - Large-scale, fine-grained manually annotated named entity recognition dataset [GitHub, 267 stars]
- tacred-relation TACRED: position-aware attention model for relation extraction [GitHub, 320 stars]
- tacrev TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [GitHub, 52 stars]
- tac-self-attention Relation extraction with position-aware self-attention [GitHub, 62 stars]
- Re-TACRED Re-TACRED: Addressing Shortcomings of the TACRED Dataset [GitHub, 34 stars]
- NeuralCoref 4.0: Coreference Resolution in spaCy with Neural Networks by HuggingFace [GitHub, 2525 stars]
- coref - BERT and SpanBERT for Coreference Resolution [GitHub, 368 stars]
- Neural Adaptation in Natural Language Processing - curated list [GitHub, 217 stars]
- CMU LTI Low Resource NLP Bootcamp 2020 - CMU Language Technologies Institute low resource NLP bootcamp 2020 [GitHub, 534 stars]
- Gramformer - ramework for detecting, highlighting and correcting grammatical errors [GitHub, 1065 stars]
- NeuSpell - A Neural Spelling Correction Toolkit [GitHub, 430 stars]
- SymSpellPy - Python port of SymSpell [GitHub, 559 stars]
- Speller100 by Microsoft [Blog, Feb 2021]
- JamSpell - spell checking library - accurate, fast, multi-language [GitHub, 479 stars]
- Styleformer - Neural Language Style Transfer framework [GitHub, 354 stars]
- StylePTB - A Compositional Benchmark for Fine-grained Controllable Text Style Transfer [GitHub, 37 stars]
- pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 701 stars]
- LDNOOBW - List of Dirty, Naughty, Obscene, and Otherwise Bad Words [GitHub, 1754 stars]
- Subreddit Analyzer - comprehensive Data and Text Mining workflow for submissions and comments from any given public subreddit [GitHub, 478 stars]
- SkillNER - rule based NLP module to extract job skills from text [GitHub, 36 stars]
- nlp-gym - NLPGym - A toolkit to develop RL agents to solve NLP tasks [GitHub, 121 stars]
- AutoNLP - Faster and easier training and deployments of SOTA NLP models [GitHub, 683 stars]
- TPOT - Python Automated Machine Learning tool [GitHub, 8534 stars]
- Auto-PyTorch - Automatic architecture search and hyperparameter optimization for PyTorch [GitHub, 1620 stars]
- HungaBunga - Brute-Force all sklearn models with all parameters using .fit .predict [GitHub, 659 stars]
- AutoML Natural Language - Google's paid AutoML NLP service
- Optuna - hyperparameter optimization framework [GitHub, 6274 stars]
- FLAML - fast and lightweight AutoML library [GitHub, 1831 stars]
- Gradsflow - open-source AutoML & PyTorch Model Training Library [GitHub, 275 stars]
- keytotext - a model which will take keywords as inputs and generate sentences as outputs [GitHub, 257 stars]
- Controllable Neural Text Generation [Blog, Jan 2021]
- BARTScore Evaluating Generated Text as Text Generation [GitHub, 113 stars]
- TitleStylist Learning to Generate Headlines with Controlled Styles [GitHub, 60 stars]
License CC0
- All linked resources belong to original authors
- Akropolis by parkjisun from the Noun Project
- Book of Ester by Gilad Sotil from the Noun Project
- quill by Juan Pablo Bravo from the Noun Project
- acting by Flatart from the Noun Project
- olympic by supalerk laipawat from the Noun Project
- aristocracy by Eucalyp from the Noun Project
- Horn by Eucalyp from the Noun Project
- temple by Eucalyp from the Noun Project
- constellation by Eucalyp from the Noun Project
- ancient greek round pattern by Olena Panasovska from the Noun Project
- Harp by Vectors Point from the Noun Project
- Atlas by parkjisun from the Noun Project
- Parthenon by Eucalyp from the Noun Project
- papyrus by IconMark from the Noun Project
- papyrus by Smalllike from the Noun Project
- pegasus by Saeful Muslim from the Noun Project