CJDClassification
learning to automatically classify Chinese judgment documents according to the industry involved in the factual content.
Running order
extract_data.py --> to extract x and y from original txts.
sentences2words.py --> to cut words in sentences
build_vocab.py --> to build vocabs
data_loader.py --> to transfer original texts to ids. for example: 我爱** -> character ids: 9, 89, 344, 1244
Training and Test
train.py --> to train/test character level CNNs for CJDClassification/THUCnews/IMDb
train_bert.py --> to train/test CWSB-CNN for CJDClassification/THUCnews/IMDb
(Please note that this requires the use of bert for token level and sentence level encoding, see project:fine-tuning BERT
-
SubIMDb
(Need to manually divide the file into train.txt, val.txt and test.txt) -
SubTHUCNews
(Need to manually divide the file into train.txt, val.txt and test.txt)