retorquere/zotero-storage-scanner

Identify Duplicate pdf's attached to a citation in Zotero, so the duplicates can be moved to trash.

Closed this issue · 28 comments

Hi Emiliano,
So if your storage scanner, just tags as #duplicate-attachments, only is looking at the file extension, then for what I'm interested in (finding duplicate pdf's attached to a citation, so that we can delete the duplicates).
A way to do that might be, only look for pdfs, and check either their filename or filesize. Some citations have 2 pdfs, but they are different pdfs. So we'd want to keep them.
Since the pdf files get renamed (from metadat), I think tagging citations with #duplicate-pdfs might be the solution. And don't compare the filename, just compare the pdf file size.
Roger

We have ~ 15,000 articles in this collection that is in zotero cloud storage.
As people add papers, many times there are duplicates, which I merge. This leads to multiple pdf copies being on a single merged citation.
So your Zotero Storage Scanner, (made in 2018), is very useful to find duplicate attachements.

I was wonder if you might be able to make it also identify the the duplication is of pdf, since the snapshots are small.

What I would like to be able to do, is find duplicate pdfs in a ciataion, in which the pdfs ahve the same file size. And knowing which are the biggest. That way I can delete these large, pdf duplicates.

Our libray is ~ 36 Gb in size, so removing the large pdfs that are duplicated would be very helpful.

Can you imagine this being a relatively straight forward update to release v5.0.8 of zotero storage scanner.
Thanks, and Thanks for all your work and progess in Better BibTex for Zotero!
Roger

What OS are you and your students on?

KDE/Ubuntu 20.04. So running Ubuntu linux.

And are you looking for byte-for-byte copies, or text similarity?

Text similarity. I expect two duplicate pdf's are of similar file size and similar content.

Scanning the files for content similarity is going to be performance intensive and I'm trying to find a solution that'd leave zotero usable while the scan is ongoing. Would an integration with rmlint be acceptable? Rmlint could do the scanning, and storage scanner could just tag attachments with the results.

Yes, rmlint doing the scanning would be fine. That sounds like a good solution

rmlint might actually find attachments on different zotero items. Just tagging them as duplicates won't help. Do you have any ideas on how this would be best handled?

Since for each zotero item it is in a folder, Then if using rmlint, is it possible to restrict the linting to single folders.
So you algo to identify duplicate attachements now (which finds the snapshots an also the pdfs that are duplicates).
If that was done first, then for each item so defined. checking those folders for duplicate pdfs. would seem to do the job.
But its a two step approach.

Items can exist in multiple folders. And you'd still have no information on which attachments are duplicates of each other.

So if rmlint finds text similarity in pdfs. And this can be two cases.

  1. 2 pdfs of 1 zotero entry,
  2. 2 pdfs but they are attached to 2 different zotero entries.

So 1 is just tell me the 2 pdfs, and I can delete one.
in 2. show me the 2 pdfs and the 2 zotero entries. And I can resolve.

So we would just have 2 version, 1 is duplicate attachment, and 2 is duplicate citation and attachment.

And I expect that I would just run this on a single computer, to clean the collection, and that it doesn't need to be a generalized tool used by all users.

So 1 is just tell me the 2 pdfs, and I can delete one.
in 2. show me the 2 pdfs and the 2 zotero entries. And I can resolve.

Tell you how. Show you how. The scanner uses a single tag so far. That's not going to work here.

So case 2, is then the case that I normally fix, by using Zotero's "duplicates" categorizing. and I merge them.

So maybe we only need to do case 1. Where 2 similar pdfs are in 1 zotero entry. And it would not say duplicate attachments if its just the snapshot. instead of a pdf.

That I can do.

Hi Emiliano,

In the interim time, I've found DupeGuru ( https://dupeguru.voltaicideas.net/ ) Which does a nice job of finding duplicate files in a directory tree on Ubuntu. And it can do it by filename, or by reading in the file etc.
So it can be my linter to find duplicates in my shared zotero library, under the /Zotero-lin/storage/ folder.

One simple question for you though. When I delete all the duplicate files from storage on my local Zotero client.

There is some way that I am supposed to force my local copy to replace the one up in Zotero Storage.

So this group library is ~ 40Gb in size and has ~ 10Gb of duplicated pdfs in it. There are about 30 people actively using and syncing to the group library.

So when I remove 10Gb of files locally, that computer has a smaller storage, but my other computers never shrink.
And instead overtime, Zotero proceeds to sync all those files back down!

So I know there is some method to say "Now Sync my local up to Zotero Storage as the master copy". But I can't figure it out from the Zotero Forums.
Do you have an idea of how I should do this?
Thanks

Hi Emiliano, bump. Did you see the comment above, about using dupeguru on ubuntu to removed duplicates.
But with this way, the question is how do I force my local copy to update the files that are up in Zotero Storage, ie to get the duplicates to be deleted in Zotero Cloud Storage.

Right now I can shrink the library on 1 computer, but then it slowly grows back, by downloading what are the duplicates again.
Roger

Please don't bump comments. My time is spread fantastically thin, and I already get an avelange of email. Additionally, if I did not see the previous email, it'd be unlikely I'd see the bump.

I've seen the comment, but from what I could see, dupeguru doesn't offer a programmatic way to drive it, just a GUI, so I have no way to interface with it. And it turns out that rmlint doesn't actually do text based fuzzy searches, so it could only find exact duplicates, not similar documents.

WRT the simple question above, the plugin would either tell zotero to delete the attachment, which would keep sync working, or tag the item and leave deletion to the user (which would also keep sync working). Just deleting the the file from disk without telling zotero about it would indeed just have zotero fetch the file again.

Can you make dupeguru export a list of files you'd want to remove, or a list of the groups you'd want to prune?

Yes, dupeGuru, after scanning for duplicates (by filename, by file contents, by various similarity factors),
Lets you save the results, and to export the results to .csv or to .html.
So this could work.

The default way of running dupeGuru, has the results show with 1 "reference" file, and then the duplicates.
So if a file is present as 6 copies under a certain top level directory, the 1st found copy is the "reference
2107dupeGuruResults.csv
file" and then the next 5 listed are the duplicates.
In this form of the results, you wouldn't delete the 1st file, but would delete the next five duplicates.

So in this example output

1,00-deJong-AppPhysL-ITOPEDOTStabilityOLED.pdf,/home/frenchrh/Zotero-lin/storage/ZPBDKEDK,367,100
1,00-deJong-AppPhysL-ITOPEDOTStabilityOLED.pdf,/home/frenchrh/Zotero-lin/storage/XGFEDS6X,367,100
1,00-deJong-AppPhysL-ITOPEDOTStabilityOLED.pdf,/home/frenchrh/Zotero-lin/storage/9GCZ7AF7,367,100
1,deJong-00-AppPhysL-ITOPEDOTStabilityOLED.pdf,/home/frenchrh/Zotero-lin/storage/RR9KEWX7,367,100
1,"De Jong et al_2000_Stability of the interface between indium-tin-oxide and poly (3,.pdf",/home/frenchrh/Zotero-lin/storage/3NUKN597,367,100
1,00-deJong-AppPhysL-ITOPEDOTStabilityOLED.pdf,/home/frenchrh/Zotero-lin/storage/IZRZZBMC,367,100

------------------- Below is the help system statement about the two exports. But "export_to_html()" didn't work for me
export_to_csv()
Export current results to CSV.
The columns and their order in the resulting CSV file is determined in the same way as in export_to_xhtml().

export_to_xhtml()
Export current results to XHTML.
The configuration of the result_table (columns order and visibility) is used to determine how the data is presented in the export. In other words, the exported table in the resulting XHTML will look just like the results table.

I'd prefer csv.

The UI for all of this is a bit of a nightmare, under the existing behavior I could just give all potential duplicates the same tag, as the context for duplicate detection is just the item they hang under. If we're looking for duplicates across items, this won't work. It would require UI work to guide the user through this, and I don't like UI work.

How about you use a tool like dupeguru to find duplicates, write those to a report of some sort, I provide an endpoint in Zotero that you could POST this report to with curl and it would return the report but with zotero:// links so you can find them in Zotero by clicking on those, seeing which you want to keep, and delete the ones you don't in Zotero. That should get you what you want, and would not involve UI work in Zotero.

ssdeep might be better at finding similar files BTW.

Of the two options.

  1. "under the existing behavior I could just give all potential duplicates the same tag" if you did this and tagged the duplicates as [duplicate] or #duplicate# . I could then delete them in Zotero.
    This would be assuming that if there are 5 copies of a single file, such as a123.pdf, then you would only tag 4 of them, arbitrarily choosing the one that is the "reference" not duplicate file.
    This could work. And I would delete the attachments.

  2. For the second approach, I can post the report.csv file. But I'm not sure I understand the mechanics.
    You say "using curl", and am I uploading it to my local Zotero Client? Or to my zotero account on the web?
    And then it "returns the report with the zotero:// links" so this sounds like I would get a webpage report with links to my zotero group library online with Zotero, and choose which to delete.

I okay with either approach, #1 sounds easier for me to understand. I'm not sure I understand #2. And I have some 10 Gb of duplicates to remove.
Roger

1. "under the existing behavior I could just give all potential duplicates the same tag"   if you did this and tagged the duplicates as [duplicate]  or #duplicate# . I could then delete them in Zotero.

You wouldn't be able to tell what each of those is a duplicate of. If A is a duplicate of B and C, and D is a duplicate of E, all under potentially different items, and all would carry the tag duplicate, that doesn't give you much information to act on -- what if there are 50? 100?

This would be assuming that if there are 5 copies of a single file, such as a123.pdf, then you would only tag 4 of them, arbitrarily choosing the one that is the "reference" not duplicate file.

That doesn't seem like a great workflow to me. In the above scenario, I would tag A and C, but you'd have no information where B is. For those items, the attachment is effectively lost, not de-duplicated.

You say "using curl", and am I uploading it to my local Zotero Client?

Yep. You post the CSV, you get back an HTML report with zotero:// links for each group. You can click those links and the corresponding item will be selected in the Zotero client, not online -- this would work even if you're not connected to a network at all. You can than look at A, B and C and decide how you want to handle dedup.

Ahh okay so let's go with the curl version. #2

Wait -- the plugin should already do what you want. If you remove the duplicates from disk, then run the scan, it should label all attachments for which you've just deleted the actual file with #broken_attachments. Just go to the root of your library, click that tag, select-all, delete.

Hi Emiliano, I just tested it out (it took some time because the collection is big),
But your right.
So I first made sure I didn't ahve any #broken attachments present or remaining.
Then DupeGuru found duplicate files in my zotero storage directory
I deleted the duplicates.
Ran your storage scanner
Found many attachment files tagged as #broken attachments.
Went through and sent them to trash.
And am now syncing back up to Zotero Storage.

     So I think this does work.  Cool. 

Roger

Cool. A scan refreshes all the tags this plugin manages, so there's no need to manage these tags before a scan.