PyTorch PoS Tagging

This repo contains tutorials covering how to do part-of-speech (PoS) tagging using PyTorch 1.4 and TorchText 0.5 using Python 3.7.

These tutorials will cover getting started with the de facto approach to PoS tagging: recurrent neural networks (RNNs). The first introduces a bi-directional LSTM (BiLSTM) network. The second covers how to fine-tune a pretrained Transformer model.

If you find any mistakes or disagree with any of the explanations, please do not hesitate to submit an issue. I welcome any feedback, positive or negative!

Getting Started

To install PyTorch, see installation instructions on the PyTorch website.

To install TorchText:

pip install torchtext

To install the transformers library:

pip install transformers

We'll also make use of spaCy to tokenize our data. To install spaCy, follow the instructions here making sure to install the English models:

python -m spacy download en

Tutorials

1 - BiLSTM for PoS Tagging

This tutorial covers the workflow of a PoS tagging project with PyTorch and TorchText. We'll introduce the basic TorchText concepts such as: defining how data is processed; using TorchText's datasets and how to use pre-trained embeddings. Using PyTorch we built a strong baseline model: a multi-layer bi-directional LSTM. We also show how the model can be used for inference to tag any input text.
2 - Fine-tuning Pretrained Transformers for PoS Tagging

This tutorial covers how to fine-tune a pretrained Transformer model, provided by the transformers library, by integrating it with TorchText. We use a pretrained BERT model to provide the embeddings for our input text and input these embeddings to a linear layer that will predict tags based on these embeddings.

References

Here are some things I looked at while making these tutorials. Some of it may be out of date.

xiexinran/pytorch-pos-tagging

PyTorch PoS Tagging

Getting Started

Tutorials

References