Preprocessing_Comparison: A Jupyter Notebook repository from richardsun-voyager

Background

In text mining projects, one important part is preprocessing which can be divided into: - Noise-removal. Usually raw texts are messy to certain extent, particularly the texts from social media which include many urls, hashtags, typos, abbreviations, emoji, punctuation and deliberatelly misspellings. These symbols seems not contain much useful information, but we do know that many times punctuation can affect the sentiment and meaning of a sentence. - Normalization. Stemmize and lemmatize the words, for example, replace "am, is ,are" with the root "be". This method can reduce the size of vocabulary. Further normalizations may include replace abbreviations with full letters, map capital words into lower cases. - Tokenization. Usually a text is stored as a long string, which can not be handled by a computer straightforward. An article consists of paragraphs, and a paragraph consists of sentences, and a sentence consists of words. The basic elements of a sentence are words. We need to tokenize texts into sequences of words to analyse semantics, syntactics and etc. Alternatively, we can also tokenize a sentence into letters or characters or n-grams.

For more information, we can refer to the course https://www.coursera.org/learn/language-processing, which covers fundamental knowledge and skills of current NLP. Three most popular NLP tools are Spacy, NLTK and Textblob, all of which provide functions such as tokenization,lemmatization, even sentiment analysis training and testing.

Preprocessing methods can affect performances of classification models because different methods can produce different vocabulary, then different representation of texts.

Each specific dataset has its own characteristics, for example, some are quite clean, some are very messy, consequently there are no universal ideal preprocessing methods that can do well for each dataset. Most current methods are based on empirical results.

Experiment Design

In this project, we aimed to check whether punctuation matters in sentiment analysis. We designed three groups of experiment: - Applied vectorizer on the raw texts without removing puctuation; - Applied vectorizer on the texts that had beed removed punctuation; - Applied vectorizer on the texts that had been lemmatized in word-level. The lemmatization was implemented by Spacy.

We tried unigram, and n-gram vectorizers. To prevent a huge vocabulary, we set limitations on the frequences of words or n-grams, that is we filtered less frequent tokens and too frequent tokens, only left medium ones.

We used IMDB dataset downloaded from http://ai.stanford.edu/~amaas/data/sentiment/ . And we compared performances of Logistic Regression, Naive Bayes.

Result

The classification accuracy on the testing dataset are shown below:

	LR+Unigram	LR+n-gram(1-3)	NB
Raw Text	0.8868	0.902	0.8318
Text without punctuation	0.8865	0.9011	0.833
Text Lemmatized	0.8815	0.9016	0.8248

Raw text:

Lemmatized Text:

Noise-removal Text:

Conclusion:

For the IMDB data, the preprocessing methods mentioned above do not have apparent influence on the classification accuracy. However, the table indicates that the raw texts work a little bit better than processed dataset.

In order to discuss this issue further, we need to do experiments on more datasets.

richardsun-voyager/Preprocessing_Comparison

Background

Experiment Design

Result

Conclusion: