bazingagin/npc_gzip

slice of ohsumed dataset?

Closed this issue · 2 comments

Hey,

In the paper, the ohsumed dataset has 3.4k train and 4k test observations.
From what I understood on hugging face https://huggingface.co/datasets/ohsumed and here http://disi.unitn.it/moschitti/corpora.htm the original dataset has way more observations.

Could you give more detail on how the dataset was obtained and where I can find it?

Sure. I used the data split from previous work: Yao, Liang, Chengsheng Mao, and Yuan Luo. "Graph convolutional networks for text classification." Proceedings of the AAAI conference on artificial intelligence. Vol. 33. No. 01. 2019.
Screenshot 2023-07-23 at 7 07 08 PM

You can download here: https://github.com/yao8839836/text_gcn/tree/master/data/ohsumed_single_23