This project is a general-purpose regular expression-based tokenizer for tweets. In order to highlight the power and limitations of a purely regular expression-based approach, tokenization is performed by pattern matching with a single regular expression; conditional statements and substitutions are deliberately not utilized.
All the scripts are placed inside a Jupyter notebook, which also includes a detailed write-up covering the following:
- Definition of a token (and the underlying rationale)
- Design decisions in the implementation of the tokenizer
- Walkthrough of the implementation of the tokenizer
- Descriptive statistics of the corpus after tokenization
- Analysis of the power and limitations of the tokenizer
- Comparative analysis with the state-of-the-art NLTK TweetTokenizer
- Performance (running time) of the tokenizer
- Analysis of the most frequent tokens
This is a major course output in an introduction to natural language processing class under Mr. Edward P. Tighe of the Department of Software Technology, De La Salle University.
This project is a Jupyter notebook, with the following Python libraries and modules used:
Library/Module | Description | License |
---|---|---|
pandas |
Provides functions for data analysis and manipulation | BSD 3-Clause "New" or "Revised" License |
csv |
Implements classes to read and write tabular data in CSV format | Python Software Foundation License |
regex |
Provides additional functionality over the standard re module while maintaining backwards-compatibility |
Apache License 2.0 |
nltk (For comparative analysis of resulting tokenization) |
Provides interfaces to corpora and lexical resources, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning | Apache License 2.0 |
The descriptions are taken from their respective websites.
- Mark Edward M. Gonzales
mark_gonzales@dlsu.edu.ph
gonzales.markedward@gmail.com
The dataset of tweets was scraped by Mr. Edward P. Tighe of the Department of Software Technology, De La Salle University. All the tweets in this dataset are public tweets collected via the Twitter API.