Chinese_Skewed_TxtClf
Chinese text classification datasets and their machine-learning based classifiers described in the paper:
Yuen-Hsien Tseng, "The Feasibility of Automated Topic Analysis: An Empirical Evaluation of Deep Learning Techniques Applied to Skew-Distributed Chinese Text Classification," Journal of Educational Media & Library Sciences, Vol. 57, No. 1 (March 2020).
Datasets are (details of the datasets can be found in the article listed below):
- WebDes
- News
- CTC
- CnonC
Classifiers:
- Naive Bayes (NB)
- Support Vector Machine (SVM)
- Random Forest (RF)
- Single hidden-layer neural network (NN)
- Convolutional Neural Networks (CNN)
- Recurrent Convolutional Neural Networks (RCNN)
- Facebook's fastText
- Bidirectional Encoder Representations from Transformers (BERT)
1. Description of Files:
- Datasets: datasets mentioned above.
- BERT_txtclf: a folder for running BERT classifier.
- BERT_txtclf_HowTo.docx: a document describing how to run the BERT classifier for the datasets.
- TxtClfer.ipynb: Self-explained Jupyter Notebook for NB, SVM, NN, CNN, RCNN. You can save it into TxtClfer.py for running in command mode.
- fastText_run_log.txt: a document and log file to describe how to run fastText classifier for the datasets.
- ft_metrics.sh: batch execution file to run fastText.
- ft_metrics.py: code required by the above batch execution file.
Note: To be able to run the BERT classifier under BERT_txtclf, you must download those imported files (or simply download all files) from https://github.com/google-research/bert to folder BERT_txtclf.
2. To cite this datasets, source codes, or experiment results:
Yuen-Hsien Tseng, "The Feasibility of Automated Topic Analysis: An Empirical Evaluation of Deep Learning Techniques Applied to Skew-Distributed Chinese Text Classification," Journal of Educational Media & Library Sciences, Vol. 57, No. 1 (March 2020).