Duplicate rows found in the parent dataset
iosonopersia opened this issue · 0 comments
iosonopersia commented
Hi @Harshdeep1996 , I'm working on the parent dataset (the 'citations_from_wikipedia.zip' file available on Zenodo).
I found some duplicated rows (approx. 2 thousands for each parquet partition file), meaning that they have the same 'id' and the same 'citations' value. As a result of the workflow of this project, the entire lines are completely equal.
Those duplicated lines should be removed from the next edition of the dataset.
As a suggestion, these lines of code could be used at some point during the workflow.