LCS2-IIITD/MOMENTA

Duplicated records in Harm-P dataset

Closed this issue · 1 comments

Hi authors, thanks for releasing the dataset. I am keen in using the Harm-C and Harm-P dataset for my own research experiments. However, I found out that the Harm-P dataset contains many duplicates across the training, validation and test dataset splits. To be precise, there are 1489, 5 and 155 duplicates in the training, validation and testing dataset splits (Refer to the screenshots attached).

Hence, can you please check out the dataset via the GitHub's download link and kindly provide an accurate dataset? Thank you in advance!

Harm-P's Train Set

Harm-P's Val Set

Harm-P's Test Set

Hi @mingshanhee, thanks for the heads up. The files are now updated. Please se the udpated readme and repo for further details.