palewire/savepagenow

savepagenow fails despite HTTP code 200 (success) in archival

baerbock opened this issue · 4 comments

$ savepagenow https://www.bergischgladbach.de/firmenportraet-zodiacdatasystems.pdfx
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/savepagenow/api.py", line 50, in capture
    archive_id = response.headers['Content-Location']
  File "/usr/lib/python3.8/site-packages/requests/structures.py", line 52, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'content-location'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/savepagenow", line 11, in <module>
    load_entry_point('savepagenow==0.0.13', 'console_scripts', 'savepagenow')()
  File "/usr/lib/python3.8/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.8/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.8/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.8/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.8/site-packages/savepagenow/api.py", line 127, in cli
    archive_url = capture(url, **kwargs)
  File "/usr/lib/python3.8/site-packages/savepagenow/api.py", line 53, in capture
    raise WaybackRuntimeError(dict(status_code=response.status_code, headers=response.headers))
savepagenow.api.WaybackRuntimeError: {'status_code': 200, 'headers': {'Server': 'nginx/1.15.8', 'Date': 'Fri, 13 Dec 2019 12:40:01 GMT', 'Content-Type': 'application/pdf', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Archive-Orig-Cache-Control': 'private', 'X-Archive-Orig-Content-Length': '108611', 'Content-Disposition': 'inline;filename=firmenportraet-zodiacdatasystems.pdf', 'X-Archive-Orig-X-XSS-Protection': '1; mode=block', 'X-Archive-Orig-X-Frame-Options': 'SAMEORIGIN', 'X-Archive-Orig-Referrer-Policy': 'strict-origin-when-cross-origin', 'X-Archive-Orig-X-Content-Type-Options': 'nosniff', 'X-Archive-Orig-Strict-Transport-Security': 'max-age=31536000', 'X-Archive-Orig-Date': 'Fri, 13 Dec 2019 12:39:59 GMT', 'X-Archive-Orig-Connection': 'close', 'Cache-Control': 'max-age=1800', 'X-Archive-Guessed-Content-Type': 'application/pdf', 'Memento-Datetime': 'Fri, 13 Dec 2019 12:39:59 GMT', 'Link': '<https://www.bergischgladbach.de/firmenportraet-zodiacdatasystems.pdfx>; rel="original", <https://web.archive.org/web/timemap/link/https://www.bergischgladbach.de/firmenportraet-zodiacdatasystems.pdfx>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://www.bergischgladbach.de/firmenportraet-zodiacdatasystems.pdfx>; rel="timegate", <https://web.archive.org/web/20191213123959/https://www.bergischgladbach.de/firmenportraet-zodiacdatasystems.pdfx>; rel="first memento"; datetime="Fri, 13 Dec 2019 12:39:59 GMT", <https://web.archive.org/web/20191213123959/https://www.bergischgladbach.de/firmenportraet-zodiacdatasystems.pdfx>; rel="memento"; datetime="Fri, 13 Dec 2019 12:39:59 GMT", <https://web.archive.org/web/20191213123959/https://www.bergischgladbach.de/firmenportraet-zodiacdatasystems.pdfx>; rel="last memento"; datetime="Fri, 13 Dec 2019 12:39:59 GMT"', 'Content-Security-Policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org", 'X-Archive-Src': 'live-20191213123849-wwwb-app1.us.archive.org.warc.gz', 'Server-Timing': 'captures_list;dur=181.621244, exclusion.robots.policy;dur=0.173575, CDXLines.iter;dur=11.650993, PetaboxLoader3.datanode;dur=100.712822, RedisCDXSource;dur=6.796307, exclusion.robots;dur=0.186336, load_resource;dur=15.795024, LoadShardBlock;dur=160.454660, esindex;dur=0.013466, PetaboxLoader3.resolve;dur=38.568371', 'X-App-Server': 'wwwb-app102', 'X-ts': '200', 'X-location': 'All', 'X-Cache-Key': 'httpsweb.archive.org/web/20191213123959/https://www.bergischgladbach.de/firmenportraet-zodiacdatasystems.pdfxDE', 'X-Page-Cache': 'MISS'}}

Here is my code:

#!/usr/bin/env python3
import savepagenow
url = "https://www.nvidia.com/attach/3"
archive_url, captured = savepagenow.capture_or_cache(url)

Same issue (ran using Python 3.8):

Traceback (most recent call last):
  File "/home/fox/.local/lib/python3.8/site-packages/savepagenow/api.py", line 50, in capture
    archive_id = response.headers['Content-Location']
  File "/usr/lib/python3.8/site-packages/requests/structures.py", line 52, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'content-location'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "rep.py", line 4, in <module>
    archive_url, captured = savepagenow.capture_or_cache(url)
  File "/home/fox/.local/lib/python3.8/site-packages/savepagenow/api.py", line 84, in capture_or_cache
    return capture(target_url, user_agent=user_agent, accept_cache=False), True
  File "/home/fox/.local/lib/python3.8/site-packages/savepagenow/api.py", line 53, in capture
    raise WaybackRuntimeError(dict(status_code=response.status_code, headers=response.headers))
savepagenow.api.WaybackRuntimeError: {'status_code': 200, 'headers': {'Server': 'nginx/1.15.8', 'Date': 'Fri, 13 Dec 2019 23:05:14 GMT', 'Content-Type': 'video/mpeg', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Archive-Orig-Accept-Ranges': 'bytes', 'content-disposition': 'attachment; filename=4x4_dv1.mpeg', 'X-Archive-Orig-Last-Modified': 'Fri, 13 Dec 2019 22:58:21 GMT', 'X-Archive-Orig-Server': 'ECD (daa/7D04)', 'X-Archive-Orig-X-Powered-By': 'ASP.NET', 'X-Archive-Orig-X-UA-Compatible': 'IE=10', 'X-Archive-Orig-Content-Length': '14424685', 'X-Archive-Orig-Date': 'Fri, 13 Dec 2019 23:05:11 GMT', 'X-Archive-Orig-Connection': 'close', 'Cache-Control': 'max-age=1800', 'X-Archive-Guessed-Content-Type': 'application/octet-stream', 'Memento-Datetime': 'Fri, 13 Dec 2019 23:05:10 GMT', 'Link': '<https://www.nvidia.com/attach/3>; rel="original", <https://web.archive.org/web/timemap/link/https://www.nvidia.com/attach/3>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://www.nvidia.com/attach/3>; rel="timegate", <https://web.archive.org/web/20030814182320/http://www.nvidia.com:80/attach/3>; rel="first memento"; datetime="Thu, 14 Aug 2003 18:23:20 GMT", <https://web.archive.org/web/20191213230039/https://www.nvidia.com/attach/3>; rel="prev memento"; datetime="Fri, 13 Dec 2019 23:00:39 GMT", <https://web.archive.org/web/20191213230510/https://www.nvidia.com/attach/3>; rel="memento"; datetime="Fri, 13 Dec 2019 23:05:10 GMT", <https://web.archive.org/web/20191213230510/https://www.nvidia.com/attach/3>; rel="last memento"; datetime="Fri, 13 Dec 2019 23:05:10 GMT"', 'Content-Security-Policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org", 'X-Archive-Src': 'live-20191213224952-wwwb-app1.us.archive.org.warc.gz', 'Server-Timing': 'PetaboxLoader3.datanode;dur=112.981714, esindex;dur=0.010412, exclusion.robots.policy;dur=0.148097, RedisCDXSource;dur=334.178204, captures_list;dur=605.641147, CDXLines.iter;dur=10.733472, load_resource;dur=434.584714, PetaboxLoader3.resolve;dur=41.078852, exclusion.robots;dur=0.156684, LoadShardBlock;dur=258.365624', 'X-App-Server': 'wwwb-app58', 'X-ts': '200', 'X-location': 'All', 'X-Cache-Key': 'httpsweb.archive.org/web/20191213230510/https://www.nvidia.com/attach/3DE', 'X-Page-Cache': 'MISS'}}

On a possibly related note: I have never used this wrapper before, but it also feels slow (this error is only shown after 17 seconds). Not sure if this is due to this error or a general issue. If it's really this slow, an issue would be good - maybe adapt API to allow submission of multiple URLs at once (I intend to submit hundreds of URLs).

@baerbock, try https://github.com/eggplants/wbsv-cli.
According to your first url - archive.org API rejects files like .pdfx. And about second one - I can't reach this site, seems like a major timeout or something.

There may also be an edge case where the 'content-location' header is not provided and my module isn't smart enough to know what to do with it.

This was tackled in #25 and, I hope, fixed with the version shipped today https://pypi.org/project/savepagenow/1.0.0/