/chinese-keywords

Collected sensitive Chinese keywords from various sources; for censorship testing and searching for sensitive content

chinese-keywords

Contained is a set of sensitive Chinese keywords (that is, keywords related to the Chinese Communist party, pornography, dissident material, violence/terrorism, censorship, etc). These keywords may be helpful to researchers who are searching for sensitive content in Chinese or testing for network interference.

As of Dec 9, there are 9,054 sensitive keywords collected from 13 different lists (see below for detailed info on the lists). To get a sense of what data is included in these CSV files, you can view a Google Doc spreadsheet of these 9,054 keywords sorted by the number of lists they appear on: https://docs.google.com/spreadsheets/d/19eS47Dg086vR1jh9oo51pXstYVT2wft13JGCrnAeU7A/edit?usp=sharing

The CSV files contain machine translations (from Google) and human translations/notes for most of the keywords. Many also have theme and category variables included as well thanks to various sources which have previously tagged their keyword lists. Currently, there are three different versions:

The thirteen lists this collection contains are:

Creator/source Tested on/found from # of keywords Year Method + source
The Citizen Lab Sina UC 1,818 2013 reverse engineered from the client; more analysis here; download link
The Citizen Lab Tom-Skype 2,574 2013 reverse engineered from the client; more analysis here; download link
The Citizen Lab LINE 673 2014 reverse engineered from the client; more analysis here; download link
Jason Q. Ng (Blocked on Weibo) Sina Weibo 839 2013 running Wikipedia China article titles through Sina Weibo search; more analysis and book
Xia Chu Great Firewall 669 2014 HTTP request scans of Wikipedia China articles to see if they'd trigger GFW block; more analysis here; download link (removed duplicates and keywords related to meta and user pages)
China Digital Times Sina Weibo 2,448 2014 crowdsourced testing of suspected sensitive keywords on Sina Weibo; more analysis on CDT and in CDT's Grass Mud Horse Lexicon e-book; download link
GreatFire.org Wikipedia 488 2013 testing to see if Wikipedia pages are available in China; more info; download link
Google/ATGFW.org Google/Great Firewall 456 2012 ATGFW.org and GreatFire.org reverse engineered the keywords Google was using to warn users of censorship while using their service in China; download link
Jeffrey Knockel Sina Show 910 2014 extracted list from Sina Show app; download link
Unknown 163.com 376 2008 archived by Nart Villeneuve; circulated on 163.com, a Chinese portal website download link
Omnitalk BBS users? Tencent QQ 863 2004 archived by Nart Villeneuve; extracted from Tencent QQ app, more info and analysis from CDT download link
Jed Crandall et al / "ConceptDoppler" Great Firewall 669 2008 archived by Nart Villeneuve; "HTTP keyword filtering by Internet routers"; website; paper; download link
Unknown a "blog provider" 844 2005 archived by Nart Villeneuve; according to Villeneuve: "This is a keyword list from a blog provider in China." download link
This project was started at The Citizen Lab's 2014 Connaught Summer Institute workshop.