/weiboscope-data

Download, extract and index Weiboscope data

Primary LanguagePython

Weiboscope Data

Download, extract and index the Weiboscope dataset collected from Sina Weibo by JMSC HKU (available here).

Requirements are :

* 30+ GB free space
* ElasticSearch
* ElasticSearch SmartCN Analyzer plugin
* MongoDB
* Python 2.7

Tested on Debian

How to use it

Download the complete dataset (18G), build the User API data and index all content to elasticsearch.

Install

You will need mongoDB and Elastic Search with the Smart Chinese Analyzer

bin/plugin -install elasticsearch/elasticsearch-analysis-smartcn/2.1.0

Data : the Weiboscope corpus

The Weiboscope dataset. contains sample data from 52 weeks of 2012 from more than 350,000 Chinese microbloggers who have more than 1,000 followers (Fu, Chan &Chau, 2013).

Note : this data has been anonymized

Data Set Statistics:

* Number of weibo messages: 226841122
* Number of deleted messages: 10865955
* Number of censored ('Permission Denied') messages: 86083
* Number of unique weibo users: 14387628
* 57 files, 18G