/regex-tweet-tokenizer

General-purpose regex-based tweet tokenizer that employs pattern matching with a single regular expression

Primary LanguageJupyter Notebook

RegEx-Based Tweet Tokenizer

badge badge-python badge Twitter

This project is a general-purpose regular expression-based tokenizer for tweets. In order to highlight the power and limitations of a purely regular expression-based approach, tokenization is performed by pattern matching with a single regular expression; conditional statements and substitutions are deliberately not utilized.

All the scripts are placed inside a Jupyter notebook, which also includes a detailed write-up covering the following:

  • Definition of a token (and the underlying rationale)
  • Design decisions in the implementation of the tokenizer
  • Walkthrough of the implementation of the tokenizer
  • Descriptive statistics of the corpus after tokenization
  • Analysis of the power and limitations of the tokenizer
  • Comparative analysis with the state-of-the-art NLTK TweetTokenizer
  • Performance (running time) of the tokenizer
  • Analysis of the most frequent tokens

This is a major course output in an introduction to natural language processing class under Mr. Edward P. Tighe of the Department of Software Technology, De La Salle University.

Built Using

This project is a Jupyter notebook, with the following Python libraries and modules used:

Library/Module Description License
pandas Provides functions for data analysis and manipulation BSD 3-Clause "New" or "Revised" License
csv Implements classes to read and write tabular data in CSV format Python Software Foundation License
regex Provides additional functionality over the standard re module while maintaining backwards-compatibility Apache License 2.0
nltk (For comparative analysis of resulting tokenization) Provides interfaces to corpora and lexical resources, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning Apache License 2.0

The descriptions are taken from their respective websites.

Author

The dataset of tweets was scraped by Mr. Edward P. Tighe of the Department of Software Technology, De La Salle University. All the tweets in this dataset are public tweets collected via the Twitter API.