prevent download of duplicates?
mfiers opened this issue · 7 comments
Hey - first of all - thanks! - excellent script!
I was wondering if, as an enhancement, it would make sense to maintain a database of already downloaded images, so that if you rerun a query (for example to get more images of the same query, or try a relelated query), the software does not re-download images, but either skips, or symlinks (in the case of a different output folder)?
Hey Mark. That's a good idea.
Thinking out-loud, maybe the user could specify a text file to store previously completed download links in. If they run the script again, they specify that same text file, and googliser ensures it doesn't download anything already in that file again. We'd probably have to grab a larger range of search results at the beginning. Of course eventually, googliser will run out of images.
Maybe we could use it to store failed downloads too?
Hi Mark, I've just pushed a commit that now allows the user to specify --unique filename.txt
. This file will contain every download URL that completed OK or failed for the current search term. To use this file again, specify it again on your next search.
It looks to be working OK at my end. Can you please test it and advise?
Maybe this parameter should be called --exclude
instead?
Hey Dan,
Thanks for the fast response!
I think it works, but I still do have a slight problem. If I run: /googliser.sh -p tree -n 5 -N --unique tree.txt
, I get what I expect, 6 images, and 6 urls in tree.txt. However, upon rerunning /googliser.sh -p tree -n 5 -N --unique tree.txt
, all my old files are overwritten in the tree folder (but no duplicates are downloaded it seems).
An alternative approach might be to continue counting? Or, add a (part of a) checksum to the file name?
With regards to the name, I'm not sure. --unique
seems ok to me. It does have partial overlap with --save-links
though.
thanks again!
Mark
The behaviour with regard to overwriting previous files is intended. But you can script and define an output path so each search ends up in it's own directory.
Yes, --unique
does have a similar overlap with --save-links
. --unique
only saves a list of URLs where downloads were attempted. Whereas --save-links
is every URL returned by Google in the search results.
I think --exclude
might be less confusing. ;)
I'm happy with either --exclude
or --unique
. And I agree, it is easy to script around the overwrite problem. So - all my problems are solved. Thanks.
I'll modify that parameter name now.
Cheers!