This repository contains parsed and cleaned up vocabulary wordlists for TOCFL and related tests in .csv format. Current as well as some historical versions.
TOCFL (Test of Chinese as a Foreign Language): standardized Mandarin language proficiency test for non-native speakers, a taiwanese equivalent of Mainland's HSK.
CCCC (Children's Chinese Competency Certification): related test aimed at non-native children speakers aged 7-12.
TBCL (Taiwan Benchmarks for the Chinese Language): reference wordlist and grammar points list for constructing language tests, but not a test per se.
tocfl-202307.csv
: current 2022/2023 TOCFL wordlist (html, with definitions)
- 7517 terms (7849 with variants, 7480 unique)
- 7 levels (in ID column together with term index):
L0-1nnn
= Novice 1 (準備級一級), pre-A1L0-2nnn
= Novice 2 (準備級二級), pre-A1L1-nnnn
= Beginner (入門級), CEFR A1L2-nnnn
= Basic (基礎級), CEFR A2L3-nnnn
= Intermediate (進階級), CEFR B1L4-nnnn
= Advanced (高階級), CEFR B2L5-nnnn
= Fluent (流利級), CEFR C1-C2
- Contains part-of-speech tags for all terms, but no definitions
- Source: 8000zhuyin_202307.zip / 8000zhuyin_202204.zip
tocfl-20180419.csv
: 2018 TOCFL wordlist version (html)
- 7945 terms (8106 with variants, 7399 unique), 7 levels
- Source: https://web.archive.org/web/20200227052851/http://www.sc-top.org.tw/download/8000zhuyin.zip
tocfl-20160215.csv
... tocfl-20170324.csv
: older versions of TOCFL wordlists
top-20100915.csv
, top-20111208.csv
: 2010/2011 wordlist versions (html)
- TOCFL test back then was simply called SC-TOP
- 4-5 levels (2010 file has a single level for CEFR A1-A2, L2 here)
- Terse English definitions for all terms -- useful for looking up the main intended sense or deciphering some more obscure entries on newer lists e.g.
只,zhǐ,M,"Individual measure word for box, watch or ring."
不等,bùděng,Post,vary in number
- postposition, were changed to N in bulk on newer lists.
- Different part-of-speech tags, following PAVC textbooks.
- Source: 800+800020100915.xls, list20111208.xls
cccc.csv
: CCCC 2022 wordlist (html)
- 1197 terms (1344 with variants, 1312 unique)
- 3 levels (reference):
L1-nnn
= 萌芽級, pre-A1L2-nnn
= 成長級, CEFR A1L3-nnn
= 茁壯級, CEFR A2
- Part-of-speech and basic English definitions for all terms.
- Source: https://tocfl.edu.tw/assets/files/vocabulary/CCCC_Vocabulary_2022.xls
tbcl.csv
: TBCL wordlist (html, with CE-CCDICT definitions)
- 14425 terms (14868 with variants, 14731 unique)
- 7 levels
- Definitions, part-of-speech and examples for ~1500 beginner level (L1-L3) words.
- Source: https://coct.naer.edu.tw/download/tech_report/
tocfl.ipynb
,tbcl.ipynb
: parser scriptstbcl-affix.csv
: TBCL affix table (suffixes and prefixes)tbcl-chars.csv
: TBCL character table by level (slightly different from characters used in the wordlist)tbcl-grammar.csv
: TBCL grammar points tabletocfl-cedict.csv
,tbcl-cedict.csv
: wordlists merged with CC-CEDICT definitionserrata.csv
: corrections applied to pinyin in original lists to fix various errorsexpanded/*.csv
: wordlists with term variants expanded, each variant on a separate linepleco/*.txt
: wordlists for import to Pleco as flashcards or a user dictionary -- useful to get level tags for terms in Pleco
Columns in .csv files:
ID
: term's level + index (row number in source excel file)Traditional
: term in traditional characters- May contain
()/,
to indicate variants. - See Variants field or expanded/*.csv files to get clean a hanzi.
- May contain
Simplified
: term converted to simplified charactersPinyin
: pinyin with diacritics, cleaned up and normalized. Tone changes are not indicated.POS
: part of speech, /-separated.- See description on TOCFL website for details.
- Follows part of speech in dangdai textbooks / TengPOS
- TOP wordlists follow older PAVC textbooks
Meaning
(CCCC/TOP/TBCL wordlists): English definitionWFreq
,SFreq
(TBCL): writing and speech frequency (in Sinica corpus?)MOE
(TBCL): reference IDs for looking up terms on https://dict.concised.moe.edu.tw/dictView.jsp?ID=Variants
: for entries where TOCFL gives multiple variants of a term, an expanded disambiguated list as a JSON list of objects with alternatives column values.
- Browsable/linkable HTML versions: https://ivankra.github.io/tocfl/.
- TOCFL: https://tocfl.edu.tw/
- TBCL: https://coct.naer.edu.tw/TBCL/
- SC-TOP/TOCFL archive: https://web.archive.org/web/*/http://www.sc-top.org.tw/download/*