google-research-datasets/coarse-discourse

Some posts in the provided dataset are duplicates

Closed this issue · 1 comments

It seems that there are some duplicate posts (319 posts with different ids occurring 530 times in all) and some of them are different (some have different majority links, but the paper says "we asked annotator to only annotate one relation to the closest comment in terms of thread distance that they were responding to"), (and some have different main_type but these annotations come from the same annotators).
Here are the duplicate post ids:
https://drive.google.com/open?id=0BzFTHaxcNfjfOG5HQkNWRUhHQTQ

It also seems that in the dataset given(including duplicate posts), there are 1242 posts that only have one annotation, and 3681 posts only have 2 annotations.

Hi @felicitywang , I am not able to run the provided python script. I need to build a discourse classifier. can you share the dataset?