Harshdeep1996/cite-classifications-wiki

Duplicate rows found in the parent dataset

iosonopersia opened this issue · 0 comments

Hi @Harshdeep1996 , I'm working on the parent dataset (the 'citations_from_wikipedia.zip' file available on Zenodo).

I found some duplicated rows (approx. 2 thousands for each parquet partition file), meaning that they have the same 'id' and the same 'citations' value. As a result of the workflow of this project, the entire lines are completely equal.

Those duplicated lines should be removed from the next edition of the dataset.
As a suggestion, these lines of code could be used at some point during the workflow.