This project implements a Part-of-Speech (PoS) Tagger using a Bidirectional Long Short-Term Memory (Bi-LSTM) model. The model is trained on a combination of datasets provided by NLTK and uses the universal tagset for labeling.
A Part-of-Speech (PoS) tagger is a tool used in natural language processing (NLP) to label each word in a sentence with its corresponding part of speech, such as noun, verb, adjective, etc. This process helps in understanding the syntactic structure of the sentence and the role each word plays within it.
Consider the sentence: The quick brown fox jumps over the lazy dog.
[('the', 'det'), ('quick', 'adj'), . . . ('lazy', 'adj'), ('dog', 'noun')]
Each word is tagged according to its part of speech, which helps in understanding how the words relate to each other. This tagging is crucial for various NLP tasks like parsing, text-to-speech systems, and information extraction.
The following datasets were combined for training the model:
- Treebank
- Brown
- Conll2000
The PoS tags used in this project follow the universal_tagset
, which includes:
- ADJ - Adjective
- ADP - Adposition
- ADV - Adverb
- CONJ - Conjunction
- DET - Determiner
- NOUN - Noun
- NUM - Numeral
- PRON - Pronoun
- PRT - Particle
- VERB - Verb
- . - Punctuation
- X - Other (residual elements)
The project follows these steps to train the PoS tagger:
-
Data Preparation:
- Combine datasets from NLTK.
- Split the data into training, validation, and test sets.
-
Tokenization:
- Create tokenizers for sentences (input
x
) and tags (outputy
). - Automatically generate vocabulary while creating tokenizers.
- Create tokenizers for sentences (input
-
Sequence Conversion:
- Convert sentences into sequences of tokens.
- Pad sequences to ensure uniform input size.
-
One-Hot Encoding:
- Convert tag sequences to one-hot encoding since they are categorical.
-
Model Training:
- Train the Bi-LSTM model using the processed data.
-
Testing and Inference:
- Test the model on the test dataset.
- Perform inference on new sentences to predict PoS tags.
-
Embedding Layer:
- This layer converts input tokens into dense vector representations of a specified dimension (
embedding_dim
). It captures the semantic meaning of words. - Input Shape:
(Batch size, MAX_SEN_LEN)
- e.g.,(256, 161)
- Output Shape:
(Batch size, MAX_SEN_LEN, Embedding_Dim)
- e.g.,(256, 161, 128)
- This layer converts input tokens into dense vector representations of a specified dimension (
-
Bidirectional LSTM Layer:
- This layer processes the input sequences in both forward and backward directions to capture dependencies from both ends of the sentence.
- Input Shape:
(Batch size, MAX_SEN_LEN, Embedding_Dim)
- e.g.,(256, 161, 128)
- Output Shape:
(Batch size, MAX_SEN_LEN, 2 * lstm_units)
- e.g.,(256, 161, 256)
-
TimeDistributed Dense Layer:
- The
TimeDistributed
Dense layer applies a fully connected layer to each time step independently, producing a probability distribution over the possible PoS tags (num_classes
) for each token. - Input Shape:
(Batch size, MAX_SEN_LEN, 2 * lstm_units)
- e.g.,(256, 161, 256)
- Output Shape:
(Batch size, MAX_SEN_LEN, num_classes)
- e.g.,(256, 161, 10)
- The
-
Compilation:
- The model is compiled using the
categorical_crossentropy
loss function, which is suitable for the multi-class classification problem of PoS tagging. - The
adam
optimizer is used for training, and the model's performance is evaluated using accuracy.
- The model is compiled using the
This architecture allows the model to effectively capture contextual information in sentences, making it suitable for the PoS tagging task.