propublica/upton

Helper methods for scraping one page and for scraping multiple

Opened this issue · 5 comments

That Scraper.new takes EITHER a url and a selector OR an array of URLs is confusing. Should keep both on new for backwards compatibility, but add a helper method for each pattern -- and use those helper methods in the README.

This will hopefully allay some of the confusion in #30 and address the API problems that were mentioned in #5 without such a dramatic refactor.

Scraper#index will return a Scraper instance with (perhaps deferred for actual fetching later) on which a #scrape call will fetch the links on the index specified by the selector expression. Scraper#instances will return a Scraper instance on which a #scrape call will fetch the links on the index specified in the argument to #instances.

I think for 1.0.0 the Scraper returned by "index" will immediately fetch the index page, so that the Scraper can be added to other scrapers, see #35. For now, it'll still only be fetched on#scrape.

I changed my mind in the last 31 minutes.

For 0.4.0 the semantics of #initialize will change. The index page will be scraped immediately. However, the syntax will not change.

Hmm, if it makes requests on the first call (e.g. Scraper.new, Scraper.index), when are options set? I guess as a hash on that first call? That'll be a breaking change. So I'll cue that up for 1.0.0

Mostly implemented in future (1.0.0) at a25e84e

Partially implemented for 0.4.0 at 24cb65e