Provide a mechanism for pre-processing html
Closed this issue · 1 comments
Problem
When writing a ScrapedPage
subclass you often need to write extra code to cope with quirks of the HTML that you're scraping. For example we often have to do some variation of URI.join(url, person[:image])
to make relative links and images absolute, and when scraping Wikipedia we do some processing on tables to make dealing with rowspan and colspan easier.
Proposed solution
Offer a way of defining transforms to be applied to the HTML before it gets passed to a ScrapedPage
subclass. html-pipeline, GitHub's HTML processing filters and utilities, looks like it might be suitable for this job.
So as a first pass we could add html-pipeline and make it sort out relative links by default.
Acceptance criteria
When writing a scraper I shouldn't need to worry about whether link and image urls are relative.
Would be good to move the HTML-specific bits of ScrapedPage
into a ScrapedPage::HTML
class (#17) before doing this ticket.