harvard-lil/scoop

Make principal web archive capture optional?

matteocargnelutti opened this issue · 5 comments

Should it be possible to skip the web capture step?

Potential use case: only capturing provenance summary, screenshot, pdf snapshot and video extraction on a given web page?

edsu commented

Is the idea that it would cut down on the amount of storage?

I can't address your question, but wanted to say: Nice to see you here, @edsu!

Hi @edsu!

Is the idea that it would cut down on the amount of storage?

It is more to account for use cases that do not revolve around capturing HTTP exchanges in a WARC.
For example, some users might just want to make a PDF capture or screenshot of a web page using Scoop, and only care about that artifact.

edsu commented

But don't you need to do the HTTP exchanges to generate the screenshot?

@edsu Yes and no.

  • Yes: the HTTP exchanges will pass through the proxy as Scoop navigates to the page to take the screenshot
  • No: If I am only interested in the screenshot, I don't need to record these HTTP exchanges, and can also skip some intermediate steps, for example some of the browser behaviors.