This repo contains a Python library to perform preprocessing for a sentiment analysis task with a CNN + embedding model.
Required Input: a string of raw text
Optional Inputs: maximum length of dictionary, maximum length of a tweet
Output: a list of indices
Four Main Methods:
-
clean_text
Remove URLs and unnecessary tokens in a tweet -
tokenize_text
Convert a string into an array of tokens using TweetTonkenizer from nltk -
replace_token_with_index
Replace each token by its index in the twitter GloVe embedding dictionary -
pad_sequence
Pad a list of indices with 0 until a maximum length
Files
preprocessor.py
: code for librarypreprocessor_test.py
: code for unit testingglove.twitter.27B.25d.index
: twitter GloVe embedding dictionary from https://github.com/stanfordnlp/GloVe.travis.yml
: for Travis CI