propublica/upton

Downloading and Caching part

Closed this issue · 7 comments

kgrz commented

I was trying to separate the downloading and caching code from the main upton.rb file. The separation would lead to easier testing and further expansion. Can we have that code extracted as a separate gem ( I was surprised to see that a gem with this functionality existed out there. Or may be it did and I didn't know). That would mean that the Upton code would depend on this external gem.

Would love to hear views for and against this.

So are you thinking a gem that sits between the client and the web, caching all requests transparently?

kgrz commented

Yes.

Kashyap KMBC

Seems like a good idea.

The only reason I can think of NOT to do that is that there may need to be a tight coupling of some higher-level Upton stuff (e.g. varying how a page is fetched based on whether it's an instance or an index, or by the index of the instance page being fetched). This can probably be accomodated with just an options hash passed to the fetch method of the new gem, so it's just something to keep in mind.

I'm not sure this would be my first priority, but I think it's a good idea. If you want to do it and submit a pull request, I'm happy to help, otherwise I'll get to it eventually.

I think separating out the get_page and fetch_page methods is all that would need to be done, along with including some of the instance-wide variables like @debug and @verbose.

kgrz commented

The only reason I can think of NOT to do that is that there may need to be a tight coupling of some higher-level Upton stuff (e.g. varying how a page is fetched based on whether it's an instance or an index, or by the index of the instance page being fetched). This can probably be accommodated with just an options hash passed to the fetch method of the new gem, so it's just something to keep in mind.

Yeah, I had that in mind. But I assure you this is different. Take a look at this for an idea: https://github.com/kgrz/cachify

The library does only two things:

  1. Makes the HTTP call to the website we specify
  2. Caches it for future usage.

Actually, the only reason I have to extract this functionality as a gem was just that there were no existing libraries for a fairly regularly sought-after functionality. :) Initially, I had all that logic inside Upton itself.

I think separating out the get_page and fetch_page methods is all that would need to be done, along with including some of the instance-wide variables like @debug and @verbose

Yes, that is what exactly I did. The @debug option (which is basically an alias for caching), is included in the library. I will add the verbose and logging parts to it next. The get_page is equivalent to Cachify::Download.(..opts..).get and it internally manages how and where the data is fetched.

And yes, I will work on this and won't waste your time :D

Awesome, awesome! Super cool. You're totally right that it makes sense to pull this stuff out of Upton. Looking forward to seeing what comes of this.

Some of the settings in get_page around accept headers and encodings might come in handy to solve various issues...

kgrz commented

Okay, I will keep an eye on those :)

Closing this issue, @kgrz, since this was solved, as I understand it, in 2764452