TUMCC is the first Chinese corpus in jargons identification field.
A total of 28,749 sentences, including 804,971 characters, from 19,821 Telegram users of 12 Telegram groups were collected when we are building the TUMCC.
We finished data screening and word segementation before we release this corpus. Thus it might be convenient for you to use.
After cleaning, TUMCC contains 3,863 sentences (a total of 100,000 characters) from 3,139 Telegram users.
TUMCC-clean.txt
contains corpus after our cleaning. You can use it directly in your research.
TUMCC-raw.7z
contains raw infomation we collected from Telegram. You can do text cleaning by yourself to get more vaild data.
For more details about the target Telegram groups for data extracting, please refer to the paper Identification of Chinese Dark Jargons in Telegram Underground Markets Using Context-Oriented and Linguistic Features
(Information Processing and Management, 2022).
All Rights Reserved. Please cite us if you use the dataset for a research purpose.