scraperwiki/code-scraper-in-browser-tool

Issue with bs4

Closed this issue · 4 comments

Using the scraper-in-browser to write a scraper for a plain, 100-row table for a workshop, and BeautifulSoup() in bs4 isn't soupifying the entire page. Old version of BeautifulSoup soupifies the page properly with scraper-in-browser.

Examples:

https://gist.github.com/danhillreports/6152491

from bs4 import BeautifulSoup
bs4
from BeautifulSoup import BeautifulSoup
beautifulsoup

Is it this bug?

http://stackoverflow.com/questions/11650700/beautifulsoup-does-not-work-for-some-web-sites/11651200#11651200

If so, add this to the line that makes the soup:

 soup = BeautifulSoup(html.content, "html.parser")

And also if so, it has affected a couple of people, so I need to look at what version of Python/bs4/lxml we use... Help finding a bug reporter in either lxml or bs4 would be really useful!

Looks like that's it! Thanks, I didn't run into that article before opening the issue.

aaja-scraper

Leaving this open as it's affected two people now. If anyone can find the upstream bugs that'd be great!

Don't think this is an issue any more.