samwize/python-email-crawler

TypeError: expected string or buffer

Opened this issue · 13 comments

Sometimes it runs, sometimes it doesn't.

[14:22:38] INFO::email_crawler - Crawling http://www.google.com.au/search?q=electrician&start=0
[14:22:39] ERROR::email_crawler - Exception at url: http://www.google.com.au/search?q=electrician&start=0
HTTP Error 503: Service Unavailable
[14:22:39] ERROR::email_crawler - EXCEPTION: expected string or buffer 

+1 Same here!

+1 Same here. Could you please suggest a fix for this? Thank you

+1 Same problem

python email_crawler.py "intext:gmail filetype:csv"
[10:14:12] INFO::email_crawler - ----------------------------------------
[10:14:12] INFO::email_crawler - Keywords to Google for: intext:gmail filetype:csv
[10:14:12] INFO::email_crawler - ----------------------------------------
[10:14:12] INFO::email_crawler - Crawling http://www.google.com/search?q=intext%3Agmail+filetype%3Acsv&start=0
[10:14:14] INFO::email_crawler - Crawling http://www.google.com/search?q=intext%3Agmail+filetype%3Acsv&start=10
...
[10:14:59] ERROR::email_crawler - Exception at url: http://www.google.com/search?q=intext%3Agmail+filetype%3Acsv&start=390
HTTP Error 503: Service Unavailable
[10:14:59] ERROR::email_crawler - EXCEPTION: expected string or buffer 
Traceback (most recent call last):
  File "email_crawler.py", line 212, in <module> 
    crawl(arg)
  File "email_crawler.py", line 65, in crawl
    for url in google_url_regex.findall(data):
TypeError: expected string or buffer

same problem

This issue should be resolved with this merge #7

issue still not resolved, same here with the last version cloned from git on my linux

mrkkr commented

I still have a problem with "TypeError: expected string or buffer" . Can anyone help?

Have the same issue as well

Here is a solution to your problem;

  1. Open the file email_crawler.py
    (If you are using the terminal use nano email_crawler.py to edit the file)
  2. Go to the 24th line saying MAX_SEARCH_RESULTS = 500 and then change it to MAX_SEARCH_RESULTS = 100

Note that the reason behind this is that due to the fact that the scripts crawls 500 pages of google, the later treats the requests as spam and proceeds accordingly as if it's a spam-like script trying to scrape the internet using Google's search engine.

I've got it too, and what @kevingatera didn't work
the exact error I get is
It happens before it even gets the second page done so it's not the script being blocked

:~/python-email-crawler$ python email_crawler.py "ios developers" [19:05:06] INFO::email_crawler - ---------------------------------------- [19:05:06] INFO::email_crawler - Keywords to Google for: ios developers [19:05:06] INFO::email_crawler - ---------------------------------------- [19:05:06] INFO::email_crawler - Crawling http://www.google.com/search?q=ios+developers&start=0 [19:05:06] ERROR::email_crawler - Exception at url: http://www.google.com/searchq=ios+developers&start=0 HTTP Error 503: Service Unavailable [19:05:06] ERROR::email_crawler - EXCEPTION: expected string or buffer traceback (most recent calll ast): File "email_crawler.py", line 212, in <module> crawl(arg) File "email_crawler.py", line 65, in crawl for url in google_url_regex.findall(data) typeError: expected string or buffer

@charlieporth1 What's happening is that Google blocks your IP almost as soon as they get your request. Using another computer/IP will work.

@kevingatera turns out I was using torify and that didn't help. You should include IP rotation similar to whats in here here I would help you if I knew more about python