teracow/googliser

zero results (again)

teracow opened this issue ยท 8 comments

Yes, Google have updated their page-code again, so some new regexes are needed to scrape the links.

Working on it now ...

Threw a quick scraper together that seems to work (haven't pushed it up to here yet).

But it's only finding a maximum of 104 unique images across 10 pages. Hmm ... have to keep looking. Unfortunately, I'm out of time now, so I'll keep looking tomorrow.

Google have certainly advanced their page-code. It gets harder each time to extract the original image URLs. ๐Ÿ˜†

OK we're out of action for now. I'll need to decode the endless-page scripting in order to request more than a single page of image results.

I'm not in a coding-cycle at the moment, and I'm unable to say when I'll be able to get around to this. Hopefully, it'll be the next time I have a few days free. ๐Ÿ˜ž

If anyone would like to have a shot at fixing this, you're more than welcome. ๐Ÿ˜

The current issue is: I can scrape the new results page, but can't trigger the endless page scrolling. So, if I separately request 10 pages of results, I actually get the first page x 10 times (with the same 100-or-so results listed on that first page).

I've pushed the new scraper to GitHub, so at least results from the first page can be found.

Now need to work out how to request the rest of the results pages (again).

Your scraper is the fastest I found, thanks!
Compared to iCrawler and google-images-download which are also struggling with the Google code change, you have at least have made it work for one page (approx. 40 img)!

What I suggest as a temporary workaround is to implement the parameters below to your parameters list like this;

--adjusted-period-min [PRESET] 
--adjusted-period-max [PRESET] 

The idea is that this should allow dowload images for multiple specified periods and thus requesting multiple pages for each class. if I do this 10 times for each class I will have 400 images per class, which is curently enough for me. Do it 20 times and you'll have your 800 again.

Unfortunately I have not the skills to produce the above suggestion...., otherwise I would have contributed more instead of only suggesting what to do :) . Hope the idea helps to solve the issue soon though.

The idea is that this should allow dowload images for multiple specified periods and thus requesting multiple pages for each class. if I do this 10 times for each class I will have 400 images per class, which is curently enough for me. Do it 20 times and you'll have your 800 again.

That's an interesting idea. ๐Ÿค“

But I'm not sure what you mean by specified periods. Do you mean the Google search parameter called 'time'?

Ni I didn't mean the time parameter, you allready offer this I guess. I was hoping there is a similar custom-period functionaly as in the text search, but it doesn't unfortunately.

However the workaround can be quite simple. Just add a year (2011) to the search phrase and with a bit of luck the page only returns images regarding your search phrase of that year. This needs a bit more testing, but first checks seem promising.

Another Google setting that is interesting is the Searchsettings option under Settings. There you can specify the quantity of search results per page. Maybe this setting can help to get more than 40 img per run.

Cheers!

Okiedoke, some good thoughts there.

I'll see if I can spend some time on it this weekend.