internetarchive/fatcat

duplicates w/ fulltext

Opened this issue · 4 comments

These seem to be 3 identical files w identical metadata:
3-dupes-plastic-factory

Thanks for the catch, and filing an issue! These are all the same version of the same paper and should be merged into a single entity. If they were different versions they would still need to be merged under the same "work" entity.

Here are the three release entities and the search query:

Some more background and details:

What happened in this particular case is that I crawled a number of "long-tail" open access journals and inserted about 1.5 million release entities from that crawl without matching to an identifier (like DOI), because most of these works don't have DOIs or other identifiers. Here's what semantic scholar and google scholar know about this paper (note no identifier):

In this case, I crawled 3 near-identical PDFs, and created new release entities for each, so there are three copies.

I wasn't aware of this category of problem from this import, but I am aware of two related problems with the long-tail import: we don't have linked "container" (journal) metadata for these 1.5 million papers, and many of the papers are actually from larger OA publishers (eg, PLOS), but got mixed in with smaller publishers on repository domains that got crawled. Here's an example of the later category of error:

There are a few solutions to these categories of problems:

  • releases will be auto-grouped into works based on metadata (title, authors, year). This is primarily to group pre-prints with published versions, but will also group these near-duplicates as a partial resolution of duplicate entries
  • future creation of release entities lacking a persistent identified (eg, DOI) will be much more conservative. For works with identifiers, we can do a fast lookup to see if something with the same ID already exists; for works without identifiers, we need to do a fuzzy match to see if something very similar already exists and should be merged. biblio-glutton is the tool we'll use for this fuzzy matching
  • targeted cleanups of the earlier 1.5 million long-tail work import are needed; at a minimum container metadata is needed. I've been working on this in the past couple weeks but haven't come up with a robust solution yet

For this specific case of three duplicates, I merged the entities in https://fatcat.wiki/editgroup/shf64rgvgreqbm4dqekjx5d4cq

Thanks for the detailed explanation and links! That really helps me visualize how changes propagate. (I still need to figure out grouping other than redirects.) If the PDFs were completely identical, might the duping still have happened?

If the PDFs are identical (usually using SHA1 to check), these failure modes shouldn't happen: import scripts do a lookup before insert.

As a fine print detail, there are something like 20 duplicate file entities (duplicates of same file) that slipped through due to a race-condition when doing early bulk imports, and I haven't cleaned these up yet (merged the entities).