datatogether/archivertools

How do we add "pattern requests" for scraping functionality?

Opened this issue · 6 comments

Just looking at edgi-govdata-archiving/archivers-harvesting-tools#8 as I wind down/clean up EDGI repos, and I think this would be wonderful. How would this type of ask work with the new and improved archivertools?

Tools for harvesting ArcGIS REST endpoints?
Do we have a procedure for pulling down whole ESRI REST endpoints? Eg https://map11.epa.gov/arcgis/rest/services.
OpenAddresses uses this useful scraper to get individual datasets from mapservers, imageservers, and featureservers: https://github.com/openaddresses/esri-dump
cc @louh

@dcwalk I'm not able to see the linked issue, it's giving me a 404. Is there text beyond the quote that I'm not seeing?

I think for this specific example, each URL would be a separate entity, and the majority of the child pages should be crawlable via the crawler b/c they're just html/json/xml. However, there are a couple services that wouldn't be crawlable, such as this application which generates KML files from map layers, which we'd need to address specifically.

So I don't have a good answer for this, but I think it raises a bigger question. My general sense of Data Together’s goal is: to provide an interface for community-held data in a way that makes it easy to access/publish data on distributed infrastructure (i.e. IPFS), and the website archiving project is a use case of DT, motivated by the desire to backup websites and data on independent machines, essentially providing a community-cache of websites and the associated data. For clarity, I'll refer to the two types of caches separately (URLs/websites vs. data)

The relevant topics brought up in this issue are:

  1. There’s a huge diversity of ways that "uncrawlable" data are presented - how should we handle/standardize the responses?
  2. How do we associate scraped/cached data with URLs, especially for APIs which use URL content as part of their input? Also, what if multiple people provide data for the same URL, how do we determine which one is the "correct" associated data

My proposed solution kind of sidesteps the issues by moving the responsibility to the community, which I think is more sustainable and also thematically appropriate:

  1. Community members must provide their own schema and metadata for the data caches they submit. We can provide guidelines, examples, and best practices, but we will not be prescriptive
  2. Community members, including the original uploader, can propose links between data caches and URL caches
  3. Most importantly, community members vote on the quality of data caches AND the appropriateness of links between caches

So you can think of this as a bipartite graph between URLs and data caches, but the community gets to choose the weight of the links. So I like this because it's more community-oriented; it's impossible and undesirable to anticipate all of the ways that data could look like, and this flexibility allows the community to prioritize and present data in ways that they think is most appropriate. I think the drawbacks are that we have fewer guarantees, since things aren't enforced at the code-level, but instead at the community level. I'm interested in hearing people's thoughts!

b5 commented

Thanks @jeffreyliu for taking the time to work through this topic. I'm pickin' up what you're throwin' down. If possible, it'd be great to figure out a way to integrate our findings here with the wonderful groundwork put forth by @mhucka on the topic.

I agree with everything you've said, so I'm going to rephrase it & see if we're on the same page. As you've pointed out, when we say "uncrawlable", we're effectively saying: some resource we don't have direct access to, presented on the web.

Analogy time! I like to think of every stated uncrawlable resource is "a call for essays on a topic". Assuming this "topic" is a single url that doesn't crawl nicely (that will probably broaden in the future). A volunteer then submits an "essay", on the topic, which is the result of running their script. And like any good essay, it's important to have clear citation. There's one minimum citation: the script that produced the result, and zero or more "cited" urls that this essay cites as being connected to the output.

Further, essays must be structured in the same way. In our case this means all scripts must produce structured data. (ideally, a single results table), this last caveat means that submissions can be compared. The only major difference in this analogy is there's no teach-student relationship, the community is both.

Hopefully this draws us back to the same conclusions you've stated @jeffreyliu: we should focus on scaffolding the process of essay submission, leaving topic dictation & paper-grading to the community.

Anyway, that's a lot of writing to say I agree completely with @jeffreyliu's take, hopefully this diatribe will help us when it comes to implementation details.

@b5 great metaphor! Yeah, I'm thinking of it as a conversation in that the community identifies "hey we have this problem" and different people can submit different solutions, which may address the problem in different ways.

One example I'm thinking of is like an interactive map with layers - some people in the community might want the raw data so they can do computational analysis on it, but some other people may just want the layers as images. So there can be multiple solutions to the same problem that address different needs. So all we do is enforce that the solution shows its work (source code) and that it adheres to some very broad guidelines about data integrity, and it's up to the community to determine the connections between that solution and other resources on DT.

we should focus on scaffolding the process of essay submission, leaving topic dictation & paper-grading to the community.

Yes, exactly! Though we can still provide examples of "good" submissions as guidelines, but they aren't hard and fast, because what ultimately matters is whether it's useful for the community.

On standup tonight:

  • Metaphor of "Issues and PRs" for people to describe data they use and are interested in (and find out if other people also use and are talking about it)
  • Think about a first iteration of technical requirements for:

a conversation in that the community identifies "hey we have this problem" and different people can submit different solutions, which may address the problem in different ways.

Gonna move issues to Roadmap & DataTogether repos to tease out these parts

@jeffreyliu's comment upstream touches on a lot of topics that probably should be discussed individually. Do we have a better way of doing that than writing more comments on this issue here? The linear stream of comments in GitHub issues makes for difficult reading when several topics need to be covered. I'm not sure making separate issues would be ideal either, because then things are disconnected, but I guess in we could combine that with a page somewhere that gathers the topics together and links to the individual issues (maybe as a wiki page? or something else?).

Thinking about my question some more, another option might be to create a project for this discussion, file individual questions as individual issues, and put the issues into project cards. Here's a quick mockup to try to convey the idea, to be more clear:
design-project

This would need to be followed up by summaries of the conclusions somewhere else. Perhaps using the github wiki for that would work well. Basically, I think that once decisions are made when it comes to conceptual & design matters, it's hard later to wrap your mind around all the individual points raised and settled. There needs to be a document that pulls it all together. One way to do that is to keep a summary in the wiki.

(It would probably be necessary to designate someone as the point person who summarizes conclusions and writes a coherent document. Although I'm already overcommitted on non-DT matters, I will volunteer for this if no one else does – it's a good opportunity to force myself to stay in the loop and understand things more deeply.)