/chat-censorship

Data related to investigation of chat client censorship

Primary LanguageLua

Overview

This repository contains keyword blacklists and lists of other content such as URLs or images used to trigger censorship in apps used in China. With the exception of WeChat, these lists were reverse engineered and are the exhaustive lists of keywords used to trigger censorship on these platforms.

The full details on data collection and analysis methods and results are available below.

Chat apps

The research below tracks daily changes to censorship in three different chat apps used in China: TOM-Skype, Sina UC, and Line. Overall, our chat app data consists of over 4,000 blacklisted keywords.

Data: TOM-Skype and Sina UC, LINE

Live-streaming apps

The research below tracks hourly changes to censorship in three different live streaming apps in China: YY, Sina Show, and 9158; and documents the keywords censored by GuaGua, which does not include a mechanism for downloading updates to its censorship blacklists. Overall, our live-streaming data consists of over 20,000 blacklisted keywords.

Data: Original live-streaming data (2015), Updated live-streaming data (2017)

Mobile games

Our research on mobile games analyzes domestic Chinese games as well as international games that have been altered to comply with Chinese regulations. Overall, we found hundreds of mobile games performing censorship, collectively censoring over 100,000 unique blacklisted keywords.

Data: Mobile games

Open source projects

This research analyzes Chinese censorship in open source projects. We extracted over 1,000 Chinese keyword blacklists from open source projects on GitHub, collectively spanning over 200,000 unique blacklisted keywords.

Data: Open source blacklists

WeChat

Our research on WeChat censorship uses sample testing to determine what type of content, such as words, URLs, and images, can be communicated over the platform and which content is censored. We have studied what categorical content WeChat generally filters in addition to what content WeChat filters in response to specific events.

Data: Keywords and URLs (November 2016), 709 Crackdown keywords and images (April 2017), Liu Xiaobo keywords and images (July 2017), 19th Party Congress keywords (November 2017), Image filtering test data (May 2018)

Keyword Content Analysis

Datasets include raw keyword lists collected from the applications. Many also include processed data including translations and categorization of keywords. Keywords were translated to English using a combination of machine and human translation. Based on interpreting these translations with contextual information, we coded each keyword into content categories grouped under six general themes according to a code book.

License

All data is provided under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International and available in full here and summarized here.