palewire/savepagenow

'content-location' error

Cauchon opened this issue · 12 comments

I'm getting an error quite often when using the CLI. Can see the live logs here: https://jlc.ninja/canarybot.txt

Traceback (most recent call last):
  File "/usr/local/bin/savepagenow", line 11, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/savepagenow/api.py", line 117, in cli
    archive_url = capture(url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/savepagenow/api.py", line 43, in capture
    archive_id = response.headers['Content-Location']
  File "/usr/local/lib/python2.7/dist-packages/requests-2.18.4-py2.7.egg/requests/structures.py", line 54, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'content-location'

Sorry for the hassle. What version are you using?

I set it up a few days ago - so I believe I'm using the latest pip version, 0.0.10.

Here is the list of CLI commands being used, one per hour:```

savepagenow https://www.slickvpn.com/warrant-canary/
savepagenow https://nordvpn.com/about-us/
savepagenow https://protonvpn.com/blog/transparency-report/
savepagenow https://www.perfect-privacy.com/warrant-canary/
savepagenow https://www.bolehvpn.net/canary.txt
savepagenow https://www.ipredator.se/static/downloads/canary.txt
savepagenow https://www.doublehop.me/warrant_canary.txt
savepagenow https://www.ivpn.net/resources/canary.txt
savepagenow https://proxy.sh/canary
savepagenow https://tutanota.com/blog/posts/transparency-report
savepagenow https://api.azirevpn.com/v1/warrantcanary
savepagenow https://my.liquidvpn.com/canary/canary
savepagenow https://www.acevpn.com/transparency/

I should also note that it appears that the pages are getting archived by the Wayback Machine.

I suspect that you are raising a Wayback Machine response that my code is too clumsy to correctly parse. I am pushing a new version of the package to PyPI now with a try/except on that parsing of the content-location that I hope will better surface whatever is happening to your requests.

I'll let you know when it's available. Please upgrade at that time and let me know if the problem continues.

Okay. Version 0.0.11 is shipped to PyPI. Give that a try. It wouldn't surprise me if you still get an error, but I hope it will be a more informative one.

Thanks for the fast response on this. I went ahead and upgraded, will keep an eye on the logs and follow up in a few hours/tomorrow.

I just ran the script myself and got these more informative error headers in a case where content-location was not found:

{
	'X-Archive-Orig-x-frame-options': 'DENY',
	'X-Archive-Guessed-Encoding': 'utf-8',
	'Content-Security-Policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org",
	'Transfer-Encoding': 'chunked',
	'X-Archive-Orig-accept-ranges': 'bytes',
	'X-Archive-Orig-last-modified': 'Mon, 15 Oct 2018 18:16:01 GMT',
	'X-ts': '----',
	'X-Archive-Guessed-Content-Type': 'text/plain',
	'X-Archive-Orig-server': 'Apache/1.3.39',
	'X-Archive-Orig-public-key-pins': 'pin-sha256="2K/nT2DfdpiAzbeXovA8uFGbK6W3abSqxkyBQzSgUZg="; pin-sha256="aYMipRdtHfa6LJOiTrZ+JqIB1SfbN+8bI+C0XKfOswE="; max-age=2592000;',
	'X-Archive-Orig-strict-transport-security': 'max-age=31536000;',
	'X-Archive-Orig-referrer-policy': 'strict-origin',
	'Memento-Datetime': 'Sun, 25 Nov 2018 21:47:11 GMT',
	'Date': 'Sun, 25 Nov 2018 21:47:23 GMT',
	'Link': '<https://www.ipredator.se/static/downloads/canary.txt>; rel="original", <http://web.archive.org/web/timemap/link/https://www.ipredator.se/static/downloads/canary.txt>; rel="timemap"; type="application/link-format", <http://web.archive.org/web/https://www.ipredator.se/static/downloads/canary.txt>; rel="timegate", <http://web.archive.org/web/20150310051911/https://www.ipredator.se/static/downloads/canary.txt>; rel="first memento"; datetime="Tue, 10 Mar 2015 05:19:11 GMT", <http://web.archive.org/web/20181124180003/https://www.ipredator.se/static/downloads/canary.txt>; rel="prev memento"; datetime="Sat, 24 Nov 2018 18:00:03 GMT", <http://web.archive.org/web/20181125214711/https://www.ipredator.se/static/downloads/canary.txt>; rel="memento"; datetime="Sun, 25 Nov 2018 21:47:11 GMT", <http://web.archive.org/web/20181125214711/https://www.ipredator.se/static/downloads/canary.txt>; rel="last memento"; datetime="Sun, 25 Nov 2018 21:47:11 GMT"',
	'X-Archive-Orig-cache-control': 'no-store',
	'X-Page-Cache': 'MISS',
	'X-Cache-Key': 'httpweb.archive.org/web/20181125214711/https://www.ipredator.se/static/downloads/canary.txtUS',
	'X-location': 'All',
	'Server': 'nginx/1.15.5',
	'Connection': 'keep-alive',
	'X-Archive-Orig-connection': 'close',
	'X-Archive-Orig-age': '0',
	'X-Archive-Orig-content-length': '1301',
	'X-App-Server': 'wwwb-app102',
	'X-Archive-Orig-x-content-type-options': 'nosniff',
	'X-Archive-Src': 'live-20181125213758-wwwb-app4.us.archive.org.warc.gz',
	'X-Archive-Orig-x-xss-protection': '1; mode=block',
	'X-Archive-Orig-x-backend': 'hsiprod',
	'Cache-Control': 'max-age=1800',
	'X-Archive-Orig-date': 'Sun, 25 Nov 2018 21:47:12 GMT',
	'X-Archive-Orig-etag': '"515-578486df0d447-gzip"',
	'Content-Type': 'text/plain'
}

I wonder if you'll get the same thing back.

It also looks like these are 200 response codes from Wayback. So it doesn't seem like there's an error. So why no URL? Hmm. Any ideas? Unless you see something I don't, we might need to ask the maintainers of the API.

Not seeing anything on my end... I should mention that this is running from a cron script but figured that shouldn't matter.

After the update, I saw an initial success and not intermittent errors like the one (live logs at https://jlc.ninja/canarybot.txt)

{
   "X-Archive-Orig-x-frame-options": "SAMEORIGIN",
   "X-Archive-Guessed-Encoding": "utf-8",
   "Content-Security-Policy": "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org",
   "Transfer-Encoding": "chunked",
   "X-Archive-Orig-vary": "Accept-Encoding",
   "X-Archive-Orig-last-modified": "Tue, 20 Nov 2018 02:18:34 GMT",
   "X-Archive-Orig-cf-ray": "47f8129ffd9f9656-SJC",
   "X-ts": "----",
   "X-Archive-Orig-server": "cloudflare",
   "Link": "<https://my.liquidvpn.com/canary/canary>; rel=\"original\", <http://web.archive.org/web/timemap/link/https://my.liquidvpn.com/canary/canary>; rel=\"timemap\"; type=\"application/link-format\", <http://web.archive.org/web/https://my.liquidvpn.com/canary/canary>; rel=\"timegate\", <http://web.archive.org/web/20150630191644/https://my.liquidvpn.com/canary/canary>; rel=\"first memento\"; datetime=\"Tue, 30 Jun 2015 19:16:44 GMT\", <http://web.archive.org/web/20181124000005/https://my.liquidvpn.com/canary/canary>; rel=\"prev memento\"; datetime=\"Sat, 24 Nov 2018 00:00:05 GMT\", <http://web.archive.org/web/20181126000005/https://my.liquidvpn.com/canary/canary>; rel=\"memento\"; datetime=\"Mon, 26 Nov 2018 00:00:05 GMT\", <http://web.archive.org/web/20181126000005/https://my.liquidvpn.com/canary/canary>; rel=\"last memento\"; datetime=\"Mon, 26 Nov 2018 00:00:05 GMT\"",
   "Memento-Datetime": "Mon, 26 Nov 2018 00:00:05 GMT",
   "Date": "Mon, 26 Nov 2018 00:00:05 GMT",
   "X-Archive-Orig-x-turbo-charged-by": "LiteSpeed",
   "X-Page-Cache": "MISS",
   "X-Cache-Key": "httpweb.archive.org/web/20181126000005/https://my.liquidvpn.com/canary/canaryUS",
   "X-location": "All",
   "Server": "nginx/1.15.5",
   "Connection": "keep-alive",
   "X-Archive-Orig-expect-ct": "max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\"",
   "X-Archive-Orig-content-length": "3189",
   "X-Archive-Orig-connection": "close",
   "X-Archive-Src": "live-20181125235907-wwwb-app15.us.archive.org.warc.gz",
   "X-Archive-Orig-set-cookie": "__cfduid=d495fc50a5f727ee7640668d46955394f1543190405; expires=Tue, 26-Nov-19 00:00:05 GMT; path=/; domain=.liquidvpn.com; HttpOnly; Secure",
   "X-Archive-Guessed-Content-Type": "text/plain",
   "Cache-Control": "max-age=1800",
   "X-Archive-Orig-date": "Mon, 26 Nov 2018 00:00:05 GMT",
   "X-App-Server": "wwwb-app103",
   "Content-Type": "text/plain"
}

Full log below:

Traceback (most recent call last):
  File "/usr/local/bin/savepagenow", line 11, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/savepagenow/api.py", line 127, in cli
    archive_url = capture(url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/savepagenow/api.py", line 53, in capture
    raise WaybackRuntimeError(response.headers)
savepagenow.api.WaybackRuntimeError: {'X-Archive-Guessed-Encoding': 'utf-8', 'Content-Security-Policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org", 'Transfer-Encoding': 'chunked', 'X-Archive-Orig-access-control-allow-headers': 'X-Requested-With', 'X-Archive-Orig-vary': 'Accept-Encoding', 'X-Archive-Orig-last-modified': 'Thu, 01 Nov 2018 14:41:33 GMT', 'X-ts': '----', 'X-Archive-Orig-x-ratelimit-limit': '60', 'X-Archive-Orig-server': 'nginx', 'Link': '<https://api.azirevpn.com/v1/warrantcanary>; rel="original", <http://web.archive.org/web/timemap/link/https://api.azirevpn.com/v1/warrantcanary>; rel="timemap"; type="application/link-format", <http://web.archive.org/web/https://api.azirevpn.com/v1/warrantcanary>; rel="timegate", <http://web.archive.org/web/20171230172644/https://api.azirevpn.com/v1/warrantcanary>; rel="first memento"; datetime="Sat, 30 Dec 2017 17:26:44 GMT", <http://web.archive.org/web/20181123230003/https://api.azirevpn.com/v1/warrantcanary>; rel="prev memento"; datetime="Fri, 23 Nov 2018 23:00:03 GMT", <http://web.archive.org/web/20181125230004/https://api.azirevpn.com/v1/warrantcanary>; rel="memento"; datetime="Sun, 25 Nov 2018 23:00:04 GMT", <http://web.archive.org/web/20181125230004/https://api.azirevpn.com/v1/warrantcanary>; rel="last memento"; datetime="Sun, 25 Nov 2018 23:00:04 GMT"', 'Memento-Datetime': 'Sun, 25 Nov 2018 23:00:04 GMT', 'Date': 'Sun, 25 Nov 2018 23:00:05 GMT', 'X-Archive-Orig-x-ratelimit-remaining': '59', 'X-Archive-Orig-cache-control': 'public', 'X-Page-Cache': 'MISS', 'X-Cache-Key': 'httpweb.archive.org/web/20181125230004/https://api.azirevpn.com/v1/warrantcanaryUS', 'X-location': 'All', 'Server': 'nginx/1.15.5', 'Connection': 'keep-alive', 'X-Archive-Orig-content-length': '1297', 'X-Archive-Orig-connection': 'close', 'X-Archive-Src': 'live-20181125223002-wwwb-app55.us.archive.org.warc.gz', 'X-Archive-Guessed-Content-Type': 'text/plain', 'Cache-Control': 'max-age=1800', 'X-Archive-Orig-date': 'Sun, 25 Nov 2018 23:00:04 GMT', 'X-App-Server': 'wwwb-app56', 'Content-Type': 'text/plain;charset=UTF-8'}
Traceback (most recent call last):
  File "/usr/local/bin/savepagenow", line 11, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/savepagenow/api.py", line 127, in cli
    archive_url = capture(url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/savepagenow/api.py", line 53, in capture
    raise WaybackRuntimeError(response.headers)
savepagenow.api.WaybackRuntimeError: {'X-Archive-Orig-x-frame-options': 'SAMEORIGIN', 'X-Archive-Guessed-Encoding': 'utf-8', 'Content-Security-Policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org", 'Transfer-Encoding': 'chunked', 'X-Archive-Orig-vary': 'Accept-Encoding', 'X-Archive-Orig-last-modified': 'Tue, 20 Nov 2018 02:18:34 GMT', 'X-Archive-Orig-cf-ray': '47f8129ffd9f9656-SJC', 'X-ts': '----', 'X-Archive-Orig-server': 'cloudflare', 'Link': '<https://my.liquidvpn.com/canary/canary>; rel="original", <http://web.archive.org/web/timemap/link/https://my.liquidvpn.com/canary/canary>; rel="timemap"; type="application/link-format", <http://web.archive.org/web/https://my.liquidvpn.com/canary/canary>; rel="timegate", <http://web.archive.org/web/20150630191644/https://my.liquidvpn.com/canary/canary>; rel="first memento"; datetime="Tue, 30 Jun 2015 19:16:44 GMT", <http://web.archive.org/web/20181124000005/https://my.liquidvpn.com/canary/canary>; rel="prev memento"; datetime="Sat, 24 Nov 2018 00:00:05 GMT", <http://web.archive.org/web/20181126000005/https://my.liquidvpn.com/canary/canary>; rel="memento"; datetime="Mon, 26 Nov 2018 00:00:05 GMT", <http://web.archive.org/web/20181126000005/https://my.liquidvpn.com/canary/canary>; rel="last memento"; datetime="Mon, 26 Nov 2018 00:00:05 GMT"', 'Memento-Datetime': 'Mon, 26 Nov 2018 00:00:05 GMT', 'Date': 'Mon, 26 Nov 2018 00:00:05 GMT', 'X-Archive-Orig-x-turbo-charged-by': 'LiteSpeed', 'X-Page-Cache': 'MISS', 'X-Cache-Key': 'httpweb.archive.org/web/20181126000005/https://my.liquidvpn.com/canary/canaryUS', 'X-location': 'All', 'Server': 'nginx/1.15.5', 'Connection': 'keep-alive', 'X-Archive-Orig-expect-ct': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"', 'X-Archive-Orig-content-length': '3189', 'X-Archive-Orig-connection': 'close', 'X-Archive-Src': 'live-20181125235907-wwwb-app15.us.archive.org.warc.gz', 'X-Archive-Orig-set-cookie': '__cfduid=d495fc50a5f727ee7640668d46955394f1543190405; expires=Tue, 26-Nov-19 00:00:05 GMT; path=/; domain=.liquidvpn.com; HttpOnly; Secure', 'X-Archive-Guessed-Content-Type': 'text/plain', 'Cache-Control': 'max-age=1800', 'X-Archive-Orig-date': 'Mon, 26 Nov 2018 00:00:05 GMT', 'X-App-Server': 'wwwb-app103', 'Content-Type': 'text/plain'}```
jwilk commented

Maybe there's something asynchronous going on and the archive URL is not available yet when the response is returned?

OTOH, the archive URL seems to be included in the Link header, so I guess you could extract it from there.

I think this was addressed by the patch with #25. Fixes is, I hope, released here: https://pypi.org/project/savepagenow/1.0.0/