/News-Popularity-Prediction

1. web scraping; 2. data preprocessing; 3. text vectorization; 4. Clustering; 5. Prediction

Primary LanguagePython

News-Popularity-Prediction

Step1 web scraping

  • web scraping.py

Use Scrapy to scrape data(news title, date, contents, # of reviews, # of participants and first 10 reviews) from www.sina.com.cn. Note: The code can be easily adjusted for other websites.

Step2 data preprocessing

  • word segmentation.py

Because there is no space between Chinese words, segmentation is needed before any text analysis. The code uses 'jieba' to segment words.

  • delete stop words.R

Delete the stop words based on a Chinese stop words reference.

Step3 text vectorization Use the tool word2vec provided by Google (available on http://word2vec.googlecode.com/svn/trunk/) to train word vectors and form clusters.

Step4 clustering

  • clustering.c

Transform titles and reviews into high-dimensional numeric vector based on clustering.

Step5 prediction

  • predcition.py

Use random forest to train the classifier to predict the popularity of a certain news title.