teracow/googliser

prevent download of duplicates?

mfiers opened this issue · 7 comments

Hey - first of all - thanks! - excellent script!

I was wondering if, as an enhancement, it would make sense to maintain a database of already downloaded images, so that if you rerun a query (for example to get more images of the same query, or try a relelated query), the software does not re-download images, but either skips, or symlinks (in the case of a different output folder)?

Hey Mark. That's a good idea.

Thinking out-loud, maybe the user could specify a text file to store previously completed download links in. If they run the script again, they specify that same text file, and googliser ensures it doesn't download anything already in that file again. We'd probably have to grab a larger range of search results at the beginning. Of course eventually, googliser will run out of images.

Maybe we could use it to store failed downloads too?

Hi Mark, I've just pushed a commit that now allows the user to specify --unique filename.txt. This file will contain every download URL that completed OK or failed for the current search term. To use this file again, specify it again on your next search.

It looks to be working OK at my end. Can you please test it and advise?

Maybe this parameter should be called --exclude instead?

Hey Dan,

Thanks for the fast response!

I think it works, but I still do have a slight problem. If I run: /googliser.sh -p tree -n 5 -N --unique tree.txt, I get what I expect, 6 images, and 6 urls in tree.txt. However, upon rerunning /googliser.sh -p tree -n 5 -N --unique tree.txt, all my old files are overwritten in the tree folder (but no duplicates are downloaded it seems).

An alternative approach might be to continue counting? Or, add a (part of a) checksum to the file name?

With regards to the name, I'm not sure. --unique seems ok to me. It does have partial overlap with --save-links though.

thanks again!
Mark

The behaviour with regard to overwriting previous files is intended. But you can script and define an output path so each search ends up in it's own directory.

Yes, --unique does have a similar overlap with --save-links. --unique only saves a list of URLs where downloads were attempted. Whereas --save-links is every URL returned by Google in the search results.

I think --exclude might be less confusing. ;)

I'm happy with either --exclude or --unique. And I agree, it is easy to script around the overwrite problem. So - all my problems are solved. Thanks.

I'll modify that parameter name now.

Cheers!