webrecorder/har2warc

problem after converting a .har -> .warc and importing in webrecorder

wsdookadr opened this issue · 1 comments

Hi,

Thank you for writing har2warc. I will describe below what I've tried and some minor differences in what I was expecting that I don't know how to explain.

I pulled a docker image of splash created by @scrapinghub like so and ran it:

docker pull scrapinghub/splash
docker run -it -p 8050:8050 --name render_html scrapinghub/splash

Then I rendered a page using splash and exported the resulting .har (as indicated in splash's docs):

curl 'http://localhost:8050/render.har?url=https://www.digitalocean.com/community/tutorials/how-to-secure-haproxy-with-let-s-encrypt-on-centos-7&timeout=10&wait=7&response_body=1' > 1.har

Then I've converted the resulting .har to .warc

har2warc 1.har 1.warc

And after this I've imported the 1.warc file into webrecoreder.
Then I viewed that file as it was stored in webrecorder and any styling seemed to be missing.

I understand and agree that this does not just involve har2warc, and the problem could originate in one of these: har2warc , splash , webrecorder . I'm not sure where to attribute this behaviour.

The general use-case would be automating a large archiving operation where the result would be a faithful reproduction of the original website, if such a website happens to contain a lot of javascript-rendered content, and nowadays that is the case with many websites.

I'd be interested in your thoughts.

Thanks,
Stefan

Me again, I was able to isolate the problem to splash, I made the following PR #821. Using that change, the pipeline splash -> har2warc -> webrecorder is now fully functional, all images and styling is showing up.
I'm going to close this issue.