ostrolucky/Bulk-Bing-Image-downloader

Doesn't download more than 100 images!!

Closed this issue · 25 comments

Hi,

I'm facing some issues here, this script is not downloading images more than 100, I think that's because of the no web driver specified (like while using selenium we use chromedriver) if you have updated script, can you please upload? I'm proving an input file it also doesn't traverse through the entire input file, at a random point the script exits (without any error).

Thanks Man, you saved a lot of my time writing a new script.

Can you share command line you used so I can reproduce? Also not sure what you mean with web driver

By web driver, I mean when we use selenium library to perform scrapping we need to provide chromedriver to it. but you are only using urllib here, maybe something like that can be provided here for urllib as well?

python bbid.py -f ./birds_name_bbid_new -o ~/dataset/images_bbid/ --limit 950

here is the file birds_name_bbid_new...
https://drive.google.com/open?id=1QnIZC--d4JoWEJzcHIoOX0uSbJ3zE6ns

It works for me. What OS do you use?

Ubuntu 18.04 LTS

Are you able to download more than 100 images?

Yep. There is 4202 images. Please check exit code of command ( via echo $? after bbid finishes)

I just ran the script again, it is not throwing any error but still, the exit code is 1 !!!

Can you show me output of ulimit -n, afterwards increase this limit via ulimit -n 1024 and see if it changes something for next command run?

it's 1024

Maybe you can also play with --threads option

yes, but I do not have a very high-speed internet connection (although it's stable), that's why 20 is enough for me, I think.
On which OS, you test this script?

I mean reduce it, not increase. I developed it on Ubuntu, but now I use Mac OS

Surely, reducing the thread might help, will get back to you after the next run...

Tried still the same issue...

Have you also tried to use 1, or 2 threads only as well? If that doesn't make difference, perhaps use strace as a last resort to see what's going on.

yes, I've used 1 thread as well, and now I'm thinking to change the script a little bit and include selenium and see how it works.

selenium is very slow, but I guess in your case it would be better than just stopping entirely

Related #15

I would need som help from somebody who can troubleshoot this. I can't fix it if I can't reproduce it

surely, I'll help you to troubleshoot, I think you'll be able to reproduce it if you give the large keyword file and run the script once it ends, check the output directory it will neither download images for all the keywords mentioned in the input file nor create directories for all keywords.

you can use the input file I uploaded earlier to try to reproduce.

This weekend I'll try to debug, and see what's happening, and let you know. As of now, I didn't get the time to troubleshoot.

I actually did use exact same file you posted here, with same CLI arguments and it downloads for me over 1000 images

for all the keywords?

Indeed, no. Can you try if increasing timeout on this line

time.sleep(0.1)
, eg. to 3 helps? Seems it does for me

@ostrolucky the script stops downloading after nearly 500 images even though I set the limit to 2000. Do you know how I can download more images?
(base) mona@mona:~/research/Bulk-Bing-Image-downloader$ ./bbid.py -s 'cat' --limit 2000

You need to make sure there actually is more than 2000 unique images on bing first. After that, you might want to experiment with changing the sleep line of code I posted, it's a bing issue.