datatogether/archivertools

Archiver UUID and url - are they still relevant?

Opened this issue · 2 comments

Currently the constructor to Archiver takes two arguments: UUID and url. I'm wondering if they're still relevant.
The UUID is a holdover from the archivers.space workflow - I'm not sure if there's an analog in DT? I think we can just remove this.
Also, in #5, we discussed that a custom crawl could span multiple urls. This raises a couple questions:

  • Does it still make sense for a scraper to be tied to a 'root' url?
  • How do we distinguish between "urls that this scraper takes data from" and "child URLs of pages linked to from this page"
ebenp commented

I'm not sure the UUID is as relevant here as it was in the archivers-space since we are using the archivers tool on either child urls or byte files,rather than individual url pages. Maybe the UUID should be the scraper root url or removed altogether.

In regards to custom crawlers spawning multiple urls it seems that a scraper has to begin at some url. Maybe that's our root url as a reference for future scraper runs and gets set with the archiver tool initialization?
For distinguishing the type of url collected maybe data collection urls should be passed inthe add data function and child url are maintained through the add url function?

b5 commented

So, you know, it's only been four months, but yes UUID's should be ignored whenever possible. I'd favor hashes for blob content and urls for anything that has a clear association to... a URL.