This is an implementation of sentimental analysis of Amazon reviews with NLP on Python 2.7 using Scikit-Learn. To analyze Amazon review, first, convert text data into number vectors. This implementation uses three kinds of vectorization technique, such as "Bags of words", "Word to Vector", and hash function. Nextly, train vector data and its label (Positive or Negative) using training algorithm. This tests "Decision Tree", "Random Forest", and "Multi Layer Perceptron". Lastly, predict whether the test data is positive or negative.
Machine learning algorithms have input data as numbers or number vectors. Since Amazon review data are text, it should convert into number vectors. This process is called vectorization. "Bags of words" algorithm generate a list of words and number vectors. A list contains every words which are contained input data. Naturally, it assigns a number to each word in the list such as index of word in a list. Thus, every text sentences could be converted number vectors and its words are mapped corresponding word in the list. This algorithms do not keep order of words. Meanwhile, if it uses hash function instead of "Bags of words", it contains order of words.
- Source code
- Test datasets
400K Amazon review texts and labeling(Positive or Negative)
- Python 2.7
- Scikit-learn
- Pandas
- Nltk
- numpy
- BeautifulSoup
- stopwords
- SnowballStemmer
$ python BagsOfWords.py
$ python WordToVector.py
$ python OnlineLearning.py
This is the second assignment of CSCI-561 Foundation of Artificial Intelligence, 2018 summer
Version 1.0