/text

Data loaders and abstractions for text and NLP

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

image

image

image

torchtext

This repository consists of:

  • torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vectors)
  • torchtext.datasets: Pre-built loaders for common NLP datasets

Installation

Make sure you have Python 2.7 or 3.5+ and PyTorch 0.2.0 or newer. You can then install torchtext using pip:

pip install torchtext

Optional requirements

If you want to use English tokenizer from SpaCy, you need to install SpaCy and download its English model:

pip install spacy
python -m spacy download en

Alternatively, you might want to use Moses tokenizer from NLTK. You have to install NLTK and download the data needed:

pip install nltk
python -m nltk.downloader perluniprops nonbreaking_prefixes

Data

The data module provides the following:

Datasets

The datasets module currently contains:

  • Sentiment analysis: SST and IMDb
  • Question classification: TREC
  • Entailment: SNLI
  • Language modeling: abstract class + WikiText-2
  • Machine translation: abstract class + Multi30k, IWSLT, WMT14
  • Sequence tagging (e.g. POS/NER): abstract class + UDPOS

Others are planned or a work in progress:

  • Question answering: SQuAD

See the test directory for examples of dataset usage.