CRutkowski/Kijiji-Scraper

Scrape broke recently

Closed this issue · 6 comments

My hourly crontab started spitting errors today:

Traceback (most recent call last):
File "/home/mewse/Kijiji-Scraper/Kijiji-Scraper.py", line 250, in
main()
File "/home/mewse/Kijiji-Scraper/Kijiji-Scraper.py", line 247, in main
scrape(url_to_scrape, old_ad_dict, exclude_list, filename, skip_flag)
File "/home/mewse/Kijiji-Scraper/Kijiji-Scraper.py", line 171, in scrape
email_title = soup.find('div', {'class': 'message'}).find('strong').text.strip('"')
AttributeError: 'NoneType' object has no attribute 'find'

I fixed it by removing .strip() from the end of the line 171

170 if not email_title: # If the email title doesnt exist pull it form the html data
171 #email_title = soup.find('div', {'class': 'message'}).find('strong').text.strip('"')
172 email_title = soup.find('div', {'class': 'message'}).find('strong').text
173 email_title = toUpper(email_title)

Crap, it's still broken

Hey, thanks for letting me know about the error.

I'll have a look tomorrow and see what's going on. From what I can tell from the error the "soup" object is None and therefore soup.find() throws an error. Just need to figure out where soup is being set to None or not being set.

Ok, I've added a small check to skip that iteration of the while loop before those lines that are throwing the error (lines 170&171 here):

168 soup = BeautifulSoup(page.content, "html.parser")
169
170 if not soup:
171 continue
172
173 if not email_title: # If the email title doesnt exist pull it form the html data
174 email_title = soup.find('div', {'class': 'message'}).find('strong').text.strip('"')
175 email_title = toUpper(email_title)

Got the error again, I think it's the second call to find() on that line, added more error checking

I think I know what's going on. If I'm cookied with Kijiji the "Get an alert with the newest ads" text is inside <div id="message"> but uncookied (incognito mode) it's in a <div id="content"> and there are multiple content divs.

I have merged a fix from @bpjobin for this issue