datatogether/archivertools

Morph/Data Together API interaction

Closed this issue · 6 comments

So I've received confirmation that we're allowed to make POST calls within scrapers, so that opens up the possibility of uploading files and results as the scraper runs.

On the DT server side, we'll need to set up API keys for users to authenticate their uploads, as well as determine the spec for what the POSTS look like. Things that I anticipate will need to be included in the upload spec:

  • API key
  • Identity of the scraper
  • UUID of the associated url
  • run metadata associated with the run or some reference to it

On the archivertools side, we'll need to provide an upload function that does the appropriate formatting and performs the upload. Things that I anticipate will need to be changed to accommodate:

  • add DT API key to constructor of the Archiver class,
  • add method Archiver.upload() or Archiver.commit()
ebenp commented

Great news! Thanks for looking into this. I think the upload spec sounds good. Do we need some metadata about what's being uploaded? content type? date accessed? I guess that could be in the run metadata.

I'm leaning slightly towards Archiver.commit as I've be thinking of these as repositories with checkpoints, rather than an expanding container, when I think of upload. However, I'm open to what makes the most sense.

b5 commented

Lovely!

Flagging relevant areas of interest from datatogether/identity: we currently have support for api keys:

It's also worth noting that we also support JSON Web Tokens (JWT's), which we could use to issue time-limited tokens, the advantage of these is we could JWT's as time-bounded, single-use tokens, which we could write down to track who initiated the scrape, the upside is these could be published without security concerns.

This token approach had a lot of value in the previous archivers.space s3-upload-server burner credentials. While people were a little confused by their use, they did facilitate clean provisioning of privileged access, which is a pattern I think could apply nicely here.

I think we should pick between api key and JWT usage, and from there I'll create the necessary issues to surface one of those two pieces of functionality in public. Both will have to happen in the long run, just a question of which to prioritize.

b5 commented

As for the spec for a POST, I think what you've outlined is a great starting point, and agree with @ebenp that the next step is to flesh out the run metadata, from there it'll be a breeze to create the endpoint.

I like using JWTs for providing transaction-level access control, though they might be a bit more confusing to use from a scraper-writer's perspective, since you would need a new one per upload.

Ideally we'd have both: have API tokens for users, and then at runtime, automatically issue their scraper a JWT so that we can track transactions. The API token would just authenticate that this user is allowed to receive a JWT, and the JWT handles the privileges for actually uploading?

I'm fine with whichever method you want to implement first, whichever is easiest.

Thinking a little bit more about this, I think we should implement API keys first - because that would be the least disruptive on the user's end. While the end-goal would preferably be a combination of API keys + automatic generation of JWT for transaction-level authentication, if we implement JWT first, users would have to switch patterns in how they upload data when we implement API keys. However, if we do API keys first, since the JWT issuing/authentication would be automatic and invisible to the user anyway, there's no disruption in terms of use pattern. So if it's not too difficult, I vote for API-key support first.

Closed via #8