A Python Library for Preprocessing

This repo contains a Python library to perform preprocessing for a sentiment analysis task with a CNN + embedding model.

Required Input: a string of raw text

Optional Inputs: maximum length of dictionary, maximum length of a tweet

Output: a list of indices

Four Main Methods:

clean_text
Remove URLs and unnecessary tokens in a tweet
tokenize_text
Convert a string into an array of tokens using TweetTonkenizer from nltk
replace_token_with_index
Replace each token by its index in the twitter GloVe embedding dictionary
pad_sequence
Pad a list of indices with 0 until a maximum length

Files

preprocessor.py: code for library
preprocessor_test.py: code for unit testing
glove.twitter.27B.25d.index: twitter GloVe embedding dictionary from https://github.com/stanfordnlp/GloVe
.travis.yml: for Travis CI

yfsui/twitter