This repo contains tutorials covering how to do part-of-speech (PoS) tagging using PyTorch 1.4 and TorchText 0.5 using Python 3.7.
These tutorials will cover getting started with the de facto approach to PoS tagging: recurrent neural networks (RNNs). The first introduces a bi-directional LSTM (BiLSTM) network. The second covers how to fine-tune a pretrained Transformer model.
If you find any mistakes or disagree with any of the explanations, please do not hesitate to submit an issue. I welcome any feedback, positive or negative!
To install PyTorch, see installation instructions on the PyTorch website.
To install TorchText:
pip install torchtext
To install the transformers library:
pip install transformers
We'll also make use of spaCy to tokenize our data. To install spaCy, follow the instructions here making sure to install the English models:
python -m spacy download en
-
This tutorial covers the workflow of a PoS tagging project with PyTorch and TorchText. We'll introduce the basic TorchText concepts such as: defining how data is processed; using TorchText's datasets and how to use pre-trained embeddings. Using PyTorch we built a strong baseline model: a multi-layer bi-directional LSTM. We also show how the model can be used for inference to tag any input text.
-
2 - Fine-tuning Pretrained Transformers for PoS Tagging
This tutorial covers how to fine-tune a pretrained Transformer model, provided by the
transformers
library, by integrating it with TorchText. We use a pretrained BERT model to provide the embeddings for our input text and input these embeddings to a linear layer that will predict tags based on these embeddings.
Here are some things I looked at while making these tutorials. Some of it may be out of date.