nghuyong/weibo-cov

Data collection time

wyiq opened this issue · 10 comments

wyiq commented

Hi HuYong,

This is a GREAT dataset and thank you all for the working!

I am wondering when did the team collect the entire dataset, as we have concerns on the censorship may affect the validity of the data.

Thanks,

非常感谢提供内容丰富、数据翔实的微博数据库!我在导入sql-2014(express版本) 提示找不到行分隔符,而且中文字符为乱码?尝试用access也遇到同样的错误提示?请问如何解决,谢谢!

@wyiq More detail you can refer our paper, https://arxiv.org/abs/2005.09174

@knightc2020 中文字符的编码为utf-8编码,请确定是数据库的编码也是正确的

wyiq commented

@nghuyong Thank you Hu Yong. But my question is quite different. Did the team collect the data in between December 1, 2019 and April 30, 2020 or at once after April 30, 2020? This question is critical as if the data is collected after April 30 -- some tweets posted on i.e. Feb. 13 might be censored in March. Whereas if the data collection process is enforced synchronically -- we may see those tweets in your dataset. The paper did not articulate this question clearly.

Thank,

@wyiq
the filed crawl_time in each data, is the timestamp when we collect that data.

@wyiq
We collect data between 1, Dec,2019 and 20,April,2020 at 25,April,2020 to 28,April,2020
and data between 21,April,2020 and 30,April,2020 at 1,May,2020 to 4,May,2020

非常感谢对数据的分享!想请问一下就是数据集里面所爬取的“微博ID”为什么不是用户的微博名称呢?

@duoremi7 对,微博id是推文的唯一标示,不涉及用户的id,所以数据集不包含任何用户信息,脱敏使用

非常感谢回复,因为是数据小白,然后一直想尝试结合专业利用微博相关数据进行社会网络分析,其中会涉及到节点用户特征分析,通过这个数据id能找到发微博的用户吗?

抱歉,不能找到。数据集不涉及微博用户信息,脱敏使用。