Contained is a set of sensitive Chinese keywords (that is, keywords related to the Chinese Communist party, pornography, dissident material, violence/terrorism, censorship, etc). These keywords may be helpful to researchers who are searching for sensitive content in Chinese or testing for network interference.
As of Dec 9, there are 9,054 sensitive keywords collected from 13 different lists (see below for detailed info on the lists). To get a sense of what data is included in these CSV files, you can view a Google Doc spreadsheet of these 9,054 keywords sorted by the number of lists they appear on: https://docs.google.com/spreadsheets/d/19eS47Dg086vR1jh9oo51pXstYVT2wft13JGCrnAeU7A/edit?usp=sharing
The CSV files contain machine translations (from Google) and human translations/notes for most of the keywords. Many also have theme and category variables included as well thanks to various sources which have previously tagged their keyword lists. Currently, there are three different versions:
- all.csv: all the keywords, all available data/variables, plus 3,987 popular (3,803 non-sensitive) keywords which can be used as possible controls for searching. These popular/non-sensitive keywords were taken from article titles of the top 1000 most viewed articles on Wikipedia China in April 2013 (995 after a few Wikipedia meta-pages were removed) and titles of articles that generated more than a total of 10 combined views on August 1, 0:00-1:00 and 12:00-13:00.
- no-dummy-vars-for-categories-and-themes.csv: all the keywords without dummy variables for each of the themes and categories that were tagged by The Citizen Lab. Category/theme info is instead stored in catch-all "category" and "theme" variable (column).
- no-dummy-vars-for-categories-and-themes_only-sensitive-words.csv: same as above except also with the non-sensitive words removed. Once downloaded, you can also sort by keyword length as well as how many of the lists each keyword appears on.
The thirteen lists this collection contains are:
Creator/source | Tested on/found from | # of keywords | Year | Method + source |
---|---|---|---|---|
The Citizen Lab | Sina UC | 1,818 | 2013 | reverse engineered from the client; more analysis here; download link |
The Citizen Lab | Tom-Skype | 2,574 | 2013 | reverse engineered from the client; more analysis here; download link |
The Citizen Lab | LINE | 673 | 2014 | reverse engineered from the client; more analysis here; download link |
Jason Q. Ng (Blocked on Weibo) | Sina Weibo | 839 | 2013 | running Wikipedia China article titles through Sina Weibo search; more analysis and book |
Xia Chu | Great Firewall | 669 | 2014 | HTTP request scans of Wikipedia China articles to see if they'd trigger GFW block; more analysis here; download link (removed duplicates and keywords related to meta and user pages) |
China Digital Times | Sina Weibo | 2,448 | 2014 | crowdsourced testing of suspected sensitive keywords on Sina Weibo; more analysis on CDT and in CDT's Grass Mud Horse Lexicon e-book; download link |
GreatFire.org | Wikipedia | 488 | 2013 | testing to see if Wikipedia pages are available in China; more info; download link |
Google/ATGFW.org | Google/Great Firewall | 456 | 2012 | ATGFW.org and GreatFire.org reverse engineered the keywords Google was using to warn users of censorship while using their service in China; download link |
Jeffrey Knockel | Sina Show | 910 | 2014 | extracted list from Sina Show app; download link |
Unknown | 163.com | 376 | 2008 | archived by Nart Villeneuve; circulated on 163.com, a Chinese portal website download link |
Omnitalk BBS users? | Tencent QQ | 863 | 2004 | archived by Nart Villeneuve; extracted from Tencent QQ app, more info and analysis from CDT download link |
Jed Crandall et al / "ConceptDoppler" | Great Firewall | 669 | 2008 | archived by Nart Villeneuve; "HTTP keyword filtering by Internet routers"; website; paper; download link |
Unknown | a "blog provider" | 844 | 2005 | archived by Nart Villeneuve; according to Villeneuve: "This is a keyword list from a blog provider in China." download link |
This project was started at The Citizen Lab's 2014 Connaught Summer Institute workshop. |