Provide a mechanism for pre-processing html

Question

Provide a mechanism for pre-processing html

Closed this issue 8 years ago · 1 comments

Problem

When writing a ScrapedPage subclass you often need to write extra code to cope with quirks of the HTML that you're scraping. For example we often have to do some variation of URI.join(url, person[:image]) to make relative links and images absolute, and when scraping Wikipedia we do some processing on tables to make dealing with rowspan and colspan easier.

Proposed solution

Offer a way of defining transforms to be applied to the HTML before it gets passed to a ScrapedPage subclass. html-pipeline, GitHub's HTML processing filters and utilities, looks like it might be suitable for this job.

So as a first pass we could add html-pipeline and make it sort out relative links by default.

Acceptance criteria

When writing a scraper I shouldn't need to worry about whether link and image urls are relative.

Answer 1 · 2016-11-15T19:08:30.000Z

Would be good to move the HTML-specific bits of ScrapedPage into a ScrapedPage::HTML class (#17) before doing this ticket.