A curated list of resources dedicated to Natural Language Processing
Maintainers - Keon Kim, Martin Park
Please feel free to pull requests, email Keon Kim (keon.kim@nyu.edu) to add links.
- Tensor Flow Tutorial on Seq2Seq Models
- Natural Language Understanding with Distributed Representation Lecture Note by Cho
- Stanford's Coursera Course on NLP from basics
- Intro to Natural Language Processing on Coursera by U of Michigan
- Intro to Artificial Intelligence course on Udacity which also covers NLP
- Deep Learning for Natural Language Processing by Richard Socher
- Pre-trained word embeddings for WSJ corpus by Koc AI-Lab
- Word2vec by Mikolov
- HLBL language model by Turian
- Real-valued vector "embeddings" by Dhillon
- Improving Word Representations Via Global Context And Multiple Word Prototypes by Huang
- Dependency based word embeddings
- Global Vectors for Word Representations
-
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
-
Node.js and Javascript - Node.js Libaries for NLP
- Twitter-text - A JavaScript implementation of Twitter's text processing library
- Knwl.js - A Natural Language Processor in JS
- Retext - Extensible system for analyzing and manipulating natural language
- NLP Compromise - Natural Language processing in the browser
- Natural - general natural language facilities for node
-
- Scikit-learn: Machine learning in Python
- Natural Language Toolkit (NLTK)
- Pattern - A web mining module for the Python programming language. It has tools for natural language processing, machine learning, among others.
- TextBlob - Providing a consistent API for diving into common natural language processing (NLP) tasks. Stands on the giant shoulders of NLTK and Pattern, and plays nicely with both.
- YAlign - A sentence aligner, a friendly tool for extracting parallel sentences from comparable corpora.
- jieba - Chinese Words Segmentation Utilities.
- SnowNLP - A library for processing Chinese text.
- KoNLPy - A Python package for Korean natural language processing.
- Rosetta - Text processing tools and wrappers (e.g. Vowpal Wabbit)
- BLLIP Parser - Python bindings for the BLLIP Natural Language Parser (also known as the Charniak-Johnson parser)
- PyNLPl - Python Natural Language Processing Library. General purpose NLP library for Python. Also contains some specific modules for parsing common NLP formats, most notably for FoLiA, but also ARPA language models, Moses phrasetables, GIZA++ alignments.
- python-ucto - Python binding to ucto (a unicode-aware rule-based tokenizer for various languages)
- python-frog - Python binding to Frog, an NLP suite for Dutch. (pos tagging, lemmatisation, dependency parsing, NER)
- python-zpar - Python bindings for ZPar, a statistical part-of-speech-tagger, constiuency parser, and dependency parser for English.
- colibri-core - Python binding to C++ library for extracting and working with with basic linguistic constructions such as n-grams and skipgrams in a quick and memory-efficient way.
- spaCy - Industrial strength NLP with Python and Cython.
- PyStanfordDependencies - Python interface for converting Penn Treebank trees to Stanford Dependencies.
-
- MIT Information Extraction Toolkit - C, C++, and Python tools for named entity recognition and relation extraction
- CRF++ - Open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data & other Natural Language Processing tasks.
- CRFsuite - CRFsuite is an implementation of Conditional Random Fields (CRFs) for labeling sequential data.
- BLLIP Parser - BLLIP Natural Language Parser (also known as the Charniak-Johnson parser)
- colibri-core - C++ library, command line tools, and Python binding for extracting and working with basic linguistic constructions such as n-grams and skipgrams in a quick and memory-efficient way.
- ucto - Unicode-aware regular-expression based tokenizer for various languages. Tool and C++ library. Supports FoLiA format.
- libfolia - C++ library for the FoLiA format
- frog - Memory-based NLP suite developed for Dutch: PoS tagger, lemmatiser, dependency parser, NER, shallow parser, morphological analyzer.
- MeTA - MeTA : ModErn Text Analysis is a C++ Data Sciences Toolkit that facilitates mining big text data.
- Mecab (Japanese)
- Mecab (Korean)
-
- Stanford NLP
- OpenNLP
- ClearNLP
- Word2vec in Java
- ReVerb Web-Scale Open Information Extraction
- OpenRegex An efficient and flexible token-based regular expression language and engine.
-
- Clojure-openNLP - Natural Language Processing in Clojure (opennlp)
- Infections-clj - Rails-like inflection library for Clojure and ClojureScript
- Deep Learning for Web Search and Natural Language Processing
- Probabilistic topic models
- Natural language processing: an introduction
- A unified architecture for natural language processing: Deep neural networks with multitask learning
- A Critical Review of Recurrent Neural Networksfor Sequence Learning
- Deep parsing in Watson
- Online named entity recognition method for microtexts in social networking services: A case study of twitter
- word2vec - on creating vectors to represent language, useful for RNN inputs
- sense2vec - on word sense disambiguation
- Infinite Dimensional Word Embeddings - new
- Skip Thought Vectors - word representation method
- Adaptive skip-gram - similar approach, with adaptive properties
- Neural autocoder for paragraphs and documents - LTSM representation
- LTSM over tree structures
- Sequence to Sequence Learning - word vectors for machine translation
- Teaching Machines to Read and Comprehend - DeepMind paper
- Efficient Estimation of Word Representations in Vector Space
- Improving distributional similarity with lessons learned from word embeddings
- Low-Dimensional Embeddings of Logic
- Tutorial on Markov Logic Networks (based on this paper)
- Markov Logic Networks for Natural Language Question Answering
- Distant Supervision for Cancer Pathway Extraction From Text
- Privee: An Architecture for Automatically Analyzing Web Privacy Policies
- A Neural Probabilistic Language Model
- Template-Based Information Extraction without the Templates
- Retrofitting word vectors to semantic lexicons
- Unsupervised Learning of the Morphology of a Natural Language
- Natural Language Processing (Almost) from Scratch
- Computational Grounded Cognition: a new alliance between grounded cognition and computational modelling
- Learning the Structure of Biomedical Relation Extractions
- Relation extraction with matrix factorization and universal schemas
- A survey of named entity recognition and classification
- Benchmarking the extraction and disambiguation of named entities on the semantic web
- Knowledge base population: Successful approaches and challenges
- SpeedRead: A fast named entity recognition Pipeline
- Cross-lingual Pseudo-Projected Expectation Regularization for Weakly Supervised Learning
- Generating Chinese Named Entity Data from a Parallel Corpus
- IXA pipeline: Efficient and Ready to Use Multilingual NLP tools
- The Unreasonable Effectiveness of Recurrent Neural Networks
- Statistical Language Models based on Neural Networks
- Slides from Google Talk
- Word2Vec
- Relation Extraction with Matrix Factorization and Universal Schemas
- Towards a Formal Distributional Semantics: Simulating Logical Calculi with Tensors
- Presentation slides for MLN tutorial
- Presentation slides for QA applications of MLNs
- Presentation slides
- Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations
- Blog Post on Deep Learning, NLP, and Representations
- Blog Post on NLP Tutorial
- Natural Language Processing Blog by Hal Daumé III
- Machine Learning Blog by Brian McFee
- POS TAGGERS
- NER
- ETC
part of the lists are from