ARBML/masader

[Error] Duplicated dataset with missing download link

Opened this issue · 3 comments

Describe the dataset error

Hi,

I was checking datasets on the great Masader site and found that two datasets are the exact duplicates, and unfortunately, the download link on the provided site is unavailable. I am mainly interested in discussing ideas for automatically detecting duplicated entries on Masader. Thanks for taking the time to read my suggestion, and reviewing this issue!

Additional context

Thank you @AMR-KELEG for the report. I removed the duplicate and it should be updated soon. In the past, we have done a duplication removal using embeddings which fixed a lot of the duplicates. Let me know if you have other ideas. All the metadata is accessible on HuggingFace https://huggingface.co/datasets/arbml/masader.

Thanks @zaidalyafeai
The current method you use sounds reasonable, and I do not think I have a better idea.
On another hand, do you think we can have a way for reporting if some datasets are not accessible anymore?

The status of datasets change a lot. It is difficult to keep track. We have a report feature that can be used.