TorchLanguage is the equivalent of TorchVision for Natural Language Processing. It gives you access to text transformers (tokens, index, n-grams, etc) and data sets.
Join our community to create datasets and deep-learning models! Chat with us on Gitter and join the Google Group to collaborate with us.
This repository consists of:
- torchlanguage.datasets : Pre-built datasets for common NLP tasks
- torchlanguage.models : Generic pretrained models for common NLP tasks
- torchlanguage.transforms : Common transformation for text
- torchlanguage.utils : Tools, functions and measures for NLP
Installation
Make sure you have Python 2.7 or 3.5+ and PyTorch 0.2.0 or newer. You can then install torchlanguage using pip::
pip install TorchLanguage
Optional requirements
If you want to use English tokenizer from SpaCy <http://spacy.io/>
_, you need to install SpaCy and download its English model::
pip install spacy
python -m spacy download en
Text transformation pipeline
The following transformation are available :
- Character
- Character2Gram
- Character3Gram
- Compose
- DropOut
- Embedding
- FunctionWord
- GensimModel
- GloveVector
- HorizontalStack
- MaxIndex
- PartOfSpeech
- RandomSamples
- RemoveCharacter
- RemoveLines
- RemoveRegex
- Tag
- ToFrequencyVector
- ToIndex
- Token
- ToLength
- ToLower
- ToNGram
- ToOneHot
- ToUpper
- Transformer
- VerticalStack
Data
The data module provides the following:
- Ability to download and load a corpus from a directory. The file must be name Class_Title.txt:
dataset = torchlanguage.datasets.FileDirectory(
root='./data',
download=True,
download_url="http://urltozip/file.zip",
transform=transformer
)
- Wrapper for dataset splits (train, validation) and cross-validation:
cross_val_dataset = {'train': torchlanguage.utils.CrossValidation(dataset, k=k),
'test': torchlanguage.utils.CrossValidation(dataset, k=k, train=False)}
for k in range(k):
for data in cross_val_dataset['train']:
inputs, label = data
# end for
for data in cross_val_dataset['test']:
inputs, label = data
# end for
cross_val_dataset['train'].next_fold()
cross_val_dataset['test'].next_fold()
# end for
Datasets
The datasets module currently contains:
- FileDirectory: Load a corpus from a directory
- ReutersC50Dataset: The Reuters C50 dataset for authorship attribution
- SFGram: A set of science-fiction magazine with five authors.
Others are planned or a work in progress:
- Traduction
- Question answering
See the examples
directory for examples of dataset usage.
Related Work
echotorch
EchoTorch is a Python framework to easily implement Reservoir Computing models with pyTorch.
Authors
- Nils Schaetti — Developer
Citing
If you find TorchLanguage useful for an academic publication, then please use the following BibTeX to cite it:
@misc{torchlanguage,
author = {Schaetti, Nils},
title = {TorchLanguage: Natural Language Processing with pyTorch},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/nschaetti/TorchLanguage}},
}