Our dataset follows previous works. For long texts, we follow Conwea. For short texts, we follow LOTClass.
We transform all their data into unified json format.
-
Download datasets from: https://drive.google.com/drive/folders/1D8E9T-vuBE-YdAd9OBy-yS4UW4AptA58?usp=sharing
-
Long text datasets(follow Conwea):
- 20Newsgroup Fine(20NF)
- 20Newsgroup Coarse(20NC)
- NYT Fine(NYT_25)
- NYT Coarse(NYT_5)
-
Short text datasets(follow LOTClass)
- Agnews
- dbpedia
- imdb
- amazon
-
-
Unzip data into './data/processed'
Another way to obtain data (Not recommended):
You can download long text data from Conwea and short text data from LOTClass and transform data into json format using our code.
The code is located at 'preprocess_data/process_long.py (process_short.py)
You need to edit the preprocess code to change the dataset path to your downloaded path and change the taskname. The processed data is located in 'data/processed'.
We alse provide preprocess code for X-class, which is 'process_x_class.py'.
This project is based on python==3.8. The dependencies are as follow:
pytorch
DGL
yacs
visdom
transformers
scikit-learn
numpy
scipy
- Recommend to start visdom to show the results.
visdom -p 8888
Open the browser to the server_ip:8888 to show visdom panel.
- Train:
-
First edit 'task/pipeline.py' to specify to config file and CUDA devices you used.
Some configuration files are provided in theconfig
folder. -
Start training:
python task/pipeline.py
-
Our code is based on multi GPUs, may be unable to run on single GPU currently.
-
-
provide datasets to dir
data/processed
.-
keywords.json
keywords for each class. type: dict. key: class_index. value: list containing all keywords for this class. See provided datasets for details. -
unlabeled.json
unlabeled sentences in our paper. type: list. item: list with 2 items([sentence_i,label_i]).
In order to facilitate the evaluation, we are similar to Conwea's settings, where labels of sentences are provided. The labels are only used for evaluation.
-
-
provide config to dir
config
. You can copy one of the existing config files and change some fields, like number_classes, classifier.type, data_dir_name etc. -
Specify the config file name in
pipeline.py
and run the pipeline code.
Please cite the following paper if you find our code helpful! Thank you very much. Paper link: https://aclanthology.org/2021.emnlp-main.222/
@inproceedings{zhang-etal-2021-weakly, title = "Weakly-supervised Text Classification Based on Keyword Graph", author = "Zhang, Lu and Ding, Jiandong and Xu, Yi and Liu, Yingyao and Zhou, Shuigeng", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021", address = "Online and Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-main.222", pages = "2803--2813", abstract = "Weakly-supervised text classification has received much attention in recent years for it can alleviate the heavy burden of annotating massive data. Among them, keyword-driven methods are the mainstream where user-provided keywords are exploited to generate pseudo-labels for unlabeled texts. However, existing methods treat keywords independently, thus ignore the correlation among them, which should be useful if properly exploited. In this paper, we propose a novel framework called ClassKG to explore keyword-keyword correlation on keyword graph by GNN. Our framework is an iterative process. In each iteration, we first construct a keyword graph, so the task of assigning pseudo labels is transformed to annotating keyword subgraphs. To improve the annotation quality, we introduce a self-supervised task to pretrain a subgraph annotator, and then finetune it. With the pseudo labels generated by the subgraph annotator, we then train a text classifier to classify the unlabeled texts. Finally, we re-extract keywords from the classified texts. Extensive experiments on both long-text and short-text datasets show that our method substantially outperforms the existing ones.", }