  • SQuAD leaderboard. A list of the strongest-performing NLP models on the Stanford Question Answering Dataset (SQuAD).
    • SQuAD 1.0 paper (Last updated October 2016). SQuAD v1.1 includes over 100,000 question and answer pairs based on Wikipedia articles.
    • SQuAD 2.0 paper (October 2018). The second generation of SQuAD includes unanswerable questions that the NLP model must identify as being unanswerable from the training data.
  • GLUE leaderboard.
    • GLUE paper (September 2018). A collection of nine NLP tasks including single-sentence tasks (e.g. check if grammar is correct, sentiment analysis), similarity and paraphrase tasks (e.g. determine if two questions are equivalent), and inference tasks (e.g. determine whether a premise contradicts a hypothesis).

Online courses






APIs and Libraries

  • R packages
    • tm: Text Mining.
    • lsa: Latent Semantic Analysis.
    • lda: Collapsed Gibbs Sampling Methods for Topic Models.
    • textir: Inverse Regression for Text Analysis.
    • corpora: Statistics and data sets for corpus frequency data.
    • tau: Text Analysis Utilities.
    • tidytext: Text mining using dplyr, ggplot2, and other tidy tools.
    • Sentiment140: Sentiment text analysis
    • sentimentr: Lexicon-based sentiment analysis.
    • cleanNLP: ML-based sentiment analysis.
    • RSentiment: Lexicon-based sentiment analysis. Contains support for negation detection and sarcasm.
    • text2vec: Fast and memory-friendly tools for text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), similarities.
    • fastTextR: Interface to the fastText library.
    • LDAvis: Interactive visualization of topic models.
    • keras: Interface to Keras, a high-level neural networks 'API'. (RStudio Blog: TensorFlow for R)
    • retweet: Client for accessing Twitter’s REST and stream APIs. (21 Recipes for Mining Twitter Data with rtweet)
    • topicmodels: Interface to the C code for Latent Dirichlet Allocation (LDA).
    • textmineR: Aid for text mining in R, with a syntax that should be familiar to experienced R users.
    • wordVectors: Creating and exploring word2vec and other word embedding models.
    • gtrendsR: Interface for retrieving and displaying the information returned online by Google Trends.
    • textstem: Tools that stem and lemmatize text.
    • NLPutils Utilities for Natural Language Processing.
    • Udpipe Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing using UDPipe.
  • Python modules
    • NLTK: Natural Language Toolkit.
    • scikit-learn: Machine Learning in Python
    • spaCy: Industrial-Strength Natural Language Processing in Python.
    • textblob: Simplified Text processing.
    • Gensim: Topic Modeling for humans.
    • Pattern.en: A fast part-of-speech tagger for English, sentiment analysis, tools for English verb conjugation and noun singularization & pluralization, and a WordNet interface.
    • textmining: Python Text Mining utilities.
    • Scrapy: Open source and collaborative framework for extracting the data you need from websites.
    • lda2vec: Tools for interpreting natural language.
    • PyText A deep-learning based NLP modeling framework built on PyTorch.
    • sent2vec: General purpose unsupervised sentence representations.
    • flair: A very simple framework for state-of-the-art Natural Language Processing (NLP)
    • word_forms: Accurately generate all possible forms of an English word e.g "election" --> "elect", "electoral", "electorate" etc.
    • AllenNLP: Open-source NLP research library, built on PyTorch.
    • Beautiful Soup: Parse HTML and XML documents. Useful for webscraping.
    • BigARTM: Fast topic modeling platform.
    • Scattertext: Beautiful visualizations of how language differs among document types.
    • embeddings: Pretrained word embeddings in Python.
    • fastText: Library for efficient learning of word representations and sentence classification.
    • Google Seq2Seq: A general-purpose encoder-decoder framework for Tensorflow that can be used for Machine Translation, Text Summarization, Conversational Modeling, Image Captioning, and more.
    • polyglot: A natural language pipeline that supports multilingual applications.
    • textacy: NLP, before and after spaCy
    • Glove-Python: A “toy” implementation of GloVe in Python. Includes a paragraph embedder.
    • Bert As A Service: Client/Server package for sentence encoding, i.e. mapping a variable-length sentence to a fixed-length vector. Design intent to provide a scalable production ready service, also allowing researchers to apply BERT quickly.
    • Keras-BERT: A Keras Implementation of BERT
    • Paragraph embedding scripts and Pre-trained models: Scripts for training and testing paragraph vectors, with links to some pre-trained Doc2Vec and Word2Vec models
    • Texthero Text preprocessing, representation and visualization from zero to hero.
  • Apache Tika: a content analysis tookilt.
  • Apache Spark: is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.
    • MLlib: MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. Related to NLP there are methods available for LDA, Word2Vec, and TFIDF.
    • LDA: latent Dirichlet allocation
    • Word2Vec: is an Estimator which takes sequences of words representing documents and trains a Word2VecModel. The model maps each word to a unique fixed-size vector. The Word2VecModel transforms each document into a vector using the average of all words in the document
    • TFIDF: term frequency-inverse document frequency
  • HDF5: an open source file format that supports large, complex, heterogeneous data. Requires no configuration.
    • h5py: Python HDF5 package
  • Stanford CoreNLP: a suite of core NLP tools
  • Stanford Parser: A probabilistic natural language parser.
  • Stanford POS Tagger: A Parts-of-Speech tagger.
  • Stanford Named Entity Recognizer: Recognizes proper nouns (things, places, organizations) and labels them as such.
  • Stanford Classifier: A softmax classifier.
  • Stanford OpenIE: Extracts relationships between words in a sentence (e.g. Mark Zuckerberg; founded; Facebook).
  • Stanford Topic Modeling Toolbox
  • MALLET: MAchine Learning for LanguagE Toolkit
  • Apache OpenNLP: Machine learning based toolkit for text NLP.
  • Streamcrab: Real-Time, Twitter sentiment analyzer engine http:/www.streamcrab.com
  • TextRazor API: Extract Meaning from your Text.
  • fastText. Library for fast text representation and classification. Facebook.
  • Comparison of Top 6 Python NLP Libraries.



Getting Data out of PDFs

Online Demos and Tools


Lexicons for Sentiment Analysis



Other Curated Lists


