/CyberCan

CyberCan is a lexicon of contemporary Cantonese based on more than 100 million pieces of internet texts from discussion forums in Hong Kong.

CyberCan

Text mining has been a dominant approach to extracting useful information from massive unstructured data online. But existing tools for Chinese word segmentation are not ideal for processing social media text data in Cantonese. This project developed CyberCan, a lexicon of contemporary Cantonese based on more than 100 million pieces of internet texts. The details regarding the creation of the lexicon could be found here: https://osf.io/preprints/socarxiv/tyjr7

Citation: Shen, F., Yu, W., Min, C., Ye, Q., Xia, C., Wang, T., & Wu, Y. (n.d.). CyberCan: A New Dictionary for Cantonese Social Media Text Segmentation. Retrieved from osf.io/preprints/socarxiv/tyjr7