/POS-Tagging

Parts-of-Speech Tagging using Hidden Markov Model and Viterbi Algorithm

Primary LanguageC++MIT LicenseMIT

POS-Tagging

POS-Tagging

Table of Contents

Introduction

Part-Of-Speech (POS) tagging is the process of assigning a part-of-speech tag (Noun, Verb, Adjective, etc.) to each word in an input text. In other words, the main objective is to identify which grammatical category do each word in given test belong to. POS Tagging is difficult because some words can represent more than one part of speech at different times, i.e. they are ambiguous in nature. Consider the following examples:

The whole team played well. adverb

You are doing well for yourself. adjective

Well, this is a lot of work. interjection

The well is dry. noun

Tears were beginning to well in her eyes. verb

For all these statements, the same word well assumes different parts of speech. Hence, we use Hidden Markov Model which is a probabilistic model along with Viterbi Algorithm to assign parts of speech tags.

Domains Explored

Machine Learning, Natural Language Processing, Dynamic Programming

Results

Accuracy of the POS Tagging Model using Viterbi algorithm is 0.9531. The accuracy of the model is determined by comparing it with true labels in /data/test.pos.

Click here to get detailed description for all Parts-of-Speech Tags.

Output 1

I have one apple and three oranges

/assets/output1

Output 2

Who is the president of USA?

/assets/output2

Output 3

India is my country of residence

/assets/output3

Documentation

For Documentation, click here or refer /documentation/README.md

File Structure

👨‍💻POS-Tagging
 ┣ 📂assets                            // Contains all the reference gifs, images
 ┣ 📂components                        // Header Files
 ┃ ┣ 📄data.cpp
 ┃ ┣ 📄data.hpp
 ┃ ┣ 📄tokenize.cpp
 ┃ ┣ 📄tokenize.hpp
 ┃ ┣ 📄viterbi.cpp
 ┃ ┣ 📄viterbi.hpp
 ┃ ┣ 📄results.cpp
 ┃ ┣ 📄results.hpp
 ┣ 📂data                              // Dataset
 ┃ ┣ 📄dataset.pos
 ┃ ┣ 📄sample.pos
 ┃ ┣ 📄test.pos
 ┣ 📂documentation                     // Notes & Documentation for project
 ┃ ┣ 📄notes.pdf
 ┃ ┣ 📄README.md
 ┣ 📂Miscellaneous                     // .ipynb implementation
 ┃ ┣ 📄POS-Tagging-C2_W2_Assignment
 ┣ 📄main.cpp
 ┣ 📄README.md

Project Workflow

assests/workflow.png

Getting Started

Prerequisites

To download and use this code, the minimum requirements are:

  • g++ : The GNU C++ compiler, available as part of the GNU Compiler Collection (GCC) or Any C++ Compiler
  • Windows 7 or later (64-bit), Any modern Linux distribution (e.g., Ubuntu, Debian, Fedora, Arch Linux)
  • Microsoft VS Code or any other IDE

Installation

Clone the project by typing the following command in your Terminal/CommandPrompt

git clone https://github.com/PritK99/POS-Tagging.git

Navigate to the MazeBlaze-v2.1 folder

cd POS-Tagging

Usage

Once the requirements are satisfied, you can easily build and run the project on your machine. Use the following commands to

  • Build the code:
g++ .\main.cpp .\components\data.cpp .\components\tokenize.cpp .\components\viterbi.cpp .\components\results.cpp
  • Run the executable
./a.out (For Linux)

or

./a (For Windows)

Acknowledgements and References

License

MIT License