cambridgeltl/mop

confusions about downstream tasks code

johnsongwx opened this issue · 6 comments

Hi! it's a great work and we are really trying to follow it.
It just occurred to us that when we are reading the code of downstream tasks, e.g. BioASQ 7b, we cannot find the code for preprocessing the original dataset. It seems that you are using *.tsv (while the original one is *.json) after splitting the original dataset. Is that we miss the code of splitting process in the project? Or we need to implement it ourself?
Also, in the original dataset for BioASQ 7b, there are cases that one question have multiple articles. Should we concatenate all of the articles as a whole to be paired with the question?
Looking forward to your reply. Thx!

Hello! Thank you for following our work. For the BioASQ 7b dataset, you can find the preprocessing code via this link. The splitting of all these datasets is the same as in the previous works. Here we only consider the binary classification task, as this is a standard-setting in the BLURB benchmarking. Details about this dataset can be found in the original paper. Hope this helps.

Thank you so much for helping me!
While doing experiments I found another question. In the S20Rel/SFull KG, there are some repeating triples. For my understandings, in the file train2id.txt , each line (each triples) represents 'head entity id, tail entity id, relationship id', so there shouldn't be repeating lines in the files. However there are a lot of these circumstances in the file. Is that because I understood it wrong?

截屏2022-04-06 下午9 20 42

Thank you for pointing this out. We have checked the original data source, there are indeed some repeated triples in the UMLS knowledge graph, which we didn't notice previously.
image

Now we have uploaded our triple generation codes for creating the "entity2id.txt" and "relation2id.txt" files, and you can generate you own triples by filtering out this repeated triples. The new cleaner version of the two datasets will be updated soon.

哈喽,想麻烦问下,之前您提到的cleaner version,就是这个去重版的知识图谱有更新嘛?

咱们这边的id和UMLS中的实体id是不是不同呀,好像在生成的过程中有重新映射?

@johnsongwx 抱歉,忘记回复了。

哈喽,想麻烦问下,之前您提到的cleaner version,就是这个去重版的知识图谱有更新嘛?

目前我没有生成新的,但你自己可以用这个triple generation codes去生成你自己想要的。

咱们这边的id和UMLS中的实体id是不是不同呀,好像在生成的过程中有重新映射?

我用的是UMLS中的CUI, 但生成的KG我没有保留ID,这个信息可以你自己做triple generation的时候取出来。关于CUI你可以参考这里.