awesome-nlp

A curated list of resources dedicated to Natural Language Processing

Maintainers - Keon Kim, Martin Park

Contributing

Please feel free to pull requests, email Keon Kim (keon.kim@nyu.edu) to add links.

Tutorials and Courses
- videos
Codes
- Implemendations
- Libraries
  - Node.js
  - Python
  - C++
  - Java
  - Clojure
  - Ruby
Articles
Blogs
Multilingual
- Spanish
Credits

Tutorials and Courses

Tensor Flow Tutorial on Seq2Seq Models
Natural Language Understanding with Distributed Representation Lecture Note by Cho

videos

Stanford's Coursera Course on NLP from basics
Intro to Natural Language Processing on Coursera by U of Michigan
Intro to Artificial Intelligence course on Udacity which also covers NLP
Deep Learning for Natural Language Processing by Richard Socher

Codes

Implementations

Pre-trained word embeddings for WSJ corpus by Koc AI-Lab
Word2vec by Mikolov
HLBL language model by Turian
Real-valued vector "embeddings" by Dhillon
Improving Word Representations Via Global Context And Multiple Word Prototypes by Huang
Dependency based word embeddings
Global Vectors for Word Representations

Libraries

TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
Node.js and Javascript - Node.js Libaries for NLP
- Twitter-text - A JavaScript implementation of Twitter's text processing library
- Knwl.js - A Natural Language Processor in JS
- Retext - Extensible system for analyzing and manipulating natural language
- NLP Compromise - Natural Language processing in the browser
- Natural - general natural language facilities for node
Python - Python NLP Libraries
- Scikit-learn: Machine learning in Python
- Natural Language Toolkit (NLTK)
- Pattern - A web mining module for the Python programming language. It has tools for natural language processing, machine learning, among others.
- TextBlob - Providing a consistent API for diving into common natural language processing (NLP) tasks. Stands on the giant shoulders of NLTK and Pattern, and plays nicely with both.
- YAlign - A sentence aligner, a friendly tool for extracting parallel sentences from comparable corpora.
- jieba - Chinese Words Segmentation Utilities.
- SnowNLP - A library for processing Chinese text.
- KoNLPy - A Python package for Korean natural language processing.
- Rosetta - Text processing tools and wrappers (e.g. Vowpal Wabbit)
- BLLIP Parser - Python bindings for the BLLIP Natural Language Parser (also known as the Charniak-Johnson parser)
- PyNLPl - Python Natural Language Processing Library. General purpose NLP library for Python. Also contains some specific modules for parsing common NLP formats, most notably for FoLiA, but also ARPA language models, Moses phrasetables, GIZA++ alignments.
- python-ucto - Python binding to ucto (a unicode-aware rule-based tokenizer for various languages)
- python-frog - Python binding to Frog, an NLP suite for Dutch. (pos tagging, lemmatisation, dependency parsing, NER)
- python-zpar - Python bindings for ZPar, a statistical part-of-speech-tagger, constiuency parser, and dependency parser for English.
- colibri-core - Python binding to C++ library for extracting and working with with basic linguistic constructions such as n-grams and skipgrams in a quick and memory-efficient way.
- spaCy - Industrial strength NLP with Python and Cython.
- PyStanfordDependencies - Python interface for converting Penn Treebank trees to Stanford Dependencies.
C++ - C++ Libraries
- MIT Information Extraction Toolkit - C, C++, and Python tools for named entity recognition and relation extraction
- CRF++ - Open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data & other Natural Language Processing tasks.
- CRFsuite - CRFsuite is an implementation of Conditional Random Fields (CRFs) for labeling sequential data.
- BLLIP Parser - BLLIP Natural Language Parser (also known as the Charniak-Johnson parser)
- colibri-core - C++ library, command line tools, and Python binding for extracting and working with basic linguistic constructions such as n-grams and skipgrams in a quick and memory-efficient way.
- ucto - Unicode-aware regular-expression based tokenizer for various languages. Tool and C++ library. Supports FoLiA format.
- libfolia - C++ library for the FoLiA format
- frog - Memory-based NLP suite developed for Dutch: PoS tagger, lemmatiser, dependency parser, NER, shallow parser, morphological analyzer.
- MeTA - MeTA : ModErn Text Analysis is a C++ Data Sciences Toolkit that facilitates mining big text data.
- Mecab (Japanese)
- Mecab (Korean)
Java - Java NLP Libraries
- Stanford NLP
- OpenNLP
- ClearNLP
- Word2vec in Java
- ReVerb Web-Scale Open Information Extraction
- OpenRegex An efficient and flexible token-based regular expression language and engine.
Clojure
- Clojure-openNLP - Natural Language Processing in Clojure (opennlp)
- Infections-clj - Rails-like inflection library for Clojure and ClojureScript
Ruby
- Kevin Dias's A collection of Natural Language Processing (NLP) Ruby libraries, tools and software

Articles

Review Articles

Word Vectors

word2vec - on creating vectors to represent language, useful for RNN inputs
sense2vec - on word sense disambiguation
Infinite Dimensional Word Embeddings - new
Skip Thought Vectors - word representation method
Adaptive skip-gram - similar approach, with adaptive properties

rakeshnb/awesome-nlp

awesome-nlp

Contributing

Table of Contents

Tutorials and Courses

videos

Codes

Implementations

Libraries

Articles

Review Articles

Word Vectors

General Natural Language Processing

Named Entity Recognition

Machine Translation

Neural Network

Supplementary Materials

Blogs

Multilingual

Spanish

Credits