still having problems with arguments
phonedude opened this issue · 16 comments
This is Martin's test URI: https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388
See: http://carbondate.cs.odu.edu/cd/https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388
and
http://carbondate.cs.odu.edu/#https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388
the json response is:
{
"self": "http://carbondate.cs.odu.edu/cd/https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388",
"uri": "https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388",
"estimated-creation-date": "2004-07-31T23:59:59",
"earliest-sources": [
"google.com"
],
"sources": {
"pubdate": {
"earliest": ""
},
"last-modified": {
"earliest": ""
},
"bitly.com": {
"earliest": "2016-08-25T12:01:11"
},
"google.com": {
"earliest": "2004-07-31T23:59:59"
},
"backlinks": {
"earliest": ""
},
"twitter.com": {
"earliest": ""
},
"bing.com": {
"earliest": ""
},
"web.archive.org": {
"uri-m": "http://web.archive.org/web/20170830193741/https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&",
"memento-datetime": "2017-08-30T19:37:41",
"memento-pubdate": "",
"earliest": "2017-08-30T19:37:41"
}
}
}
note how the second argument ("GRid=166635388") is left off the URI-M from web.archive.org. Also, I doubt the google.com date is correct (judging by the content of the page) -- can we establish that CD sent the correct URI to google.com?
So the URI-M for the web.archive.org resource is taken directly from the response provided by Memgator. The request is:
curl "http://memgator.cs.odu.edu/timemap/json/https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388"
Which returns:
{
"original_uri": "https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388",
"self": "http://memgator.cs.odu.edu/timemap/json/https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388",
"mementos": {
"list": [
{
"datetime": "2017-08-30T19:37:41Z",
"uri": "http://web.archive.org/web/20170830193741/https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&"
}
],
"first": {
"datetime": "2017-08-30T19:37:41Z",
"uri": "http://web.archive.org/web/20170830193741/https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&"
},
"last": {
"datetime": "2017-08-30T19:37:41Z",
"uri": "http://web.archive.org/web/20170830193741/https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&"
}
},
"timemap_uri": {
"link_format": "http://memgator.cs.odu.edu/timemap/link/https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388",
"json_format": "http://memgator.cs.odu.edu/timemap/json/https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388",
"cdxj_format": "http://memgator.cs.odu.edu/timemap/cdxj/https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388"
},
"timegate_uri": "http://memgator.cs.odu.edu/timegate/https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388"
}
Tracing back what Memgator has for its endpoint is the following request:
curl "http://web.archive.org/web/timemap/link/https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388"
Which returns:
<https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388>; rel="original",
<http://web.archive.org/web/timemap/link/https://www.findagrave.com/cgi-bin/fg.cgi>; rel="self"; type="application/link-format"; from="Wed, 30 Aug 2017 19:37:41 GMT">,
<http://web.archive.org>; rel="timegate",
<http://web.archive.org/web/20170830193741/https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388>; rel="first memento"; datetime="Wed, 30 Aug 2017 19:37:41 GMT",
So it seems Memgator is parsing link format incorrectly. @ibnesayeed can you check this out?
Let me look at the Google module and get back to you.
Yes, I can confirm that this is happening in MemGator. The reason is HTML encoding (rather than URL encoding) of the URI from IA. They return ?page=gr&GRid=166635388
where &
is converted to &
. This unnecessary encoding inserts a semicolon in the URI (which is a reserved character). In Link
format, colon is used as attribute separator. The link parser I wrote in MemGator had some performance optimizations in mind because it has to parse really long link formatted data. I can fix it in MemGator to not clip the URI when it sees a semicolon, but the response will still contain HTML encoded URIs. The issue should really be fixed by the @internetarchive.
yes, this is banged up pretty badly:
$ curl -i "http://web.archive.org/web/timemap/link/https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388"
HTTP/1.1 200 OK
Server: Tengine/2.1.0
Date: Wed, 25 Oct 2017 13:07:31 GMT
Content-Type: application/link-format
Transfer-Encoding: chunked
Connection: keep-alive
X-App-Server: wwwb-app43
X-ts: ----
X-Archive-Playback: 0
X-location: All
X-Page-Cache: MISS
https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388; rel="original",
http://web.archive.org/web/timemap/link/https://www.findagrave.com/cgi-bin/fg.cgi; rel="self"; type="application/link-format"; from="Wed, 30 Aug 2017 19:37:41 GMT">,
http://web.archive.org; rel="timegate",
http://web.archive.org/web/20170830193741/https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388; rel="first memento"; datetime="Wed, 30 Aug 2017 19:37:41 GMT",
@internetarchive doesn't seem to have a GitHub repo where we can report such bugs. Should we post it on MementoDev mailing list?
Hi @grantat
What I ultimately like to do is run Carbondate as a web service and tell it not to do the archive lookup, the backlinks magic, and also maybe skip google (due to the flaw from above). I don' think I can do this right now with the web service but I can at least skip the backlinks in the docker version. Is there also an option to skip the archive lookup?
I tested the docker version and
$ docker run --rm -it oduwsdl/carbondate ./main.py -l search http://cs.odu.edu
seems to works okay but
$ docker run --rm -it oduwsdl/carbondate ./main.py -l search https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388
does not return anything. I tried escaping the URI as well as putting it in single and double quotes.
Any ideas?
Hi @martinklein0815, while @grantat might give you a better way of achieving what you are trying to do, in the mean time, you may want to edit the config
file, especially the SystemUtility
field with only modules you want to collect the data from. Then run your server and talk to it.
@martinklein0815 you can exclude multiple modules by listing their filename with the -e
parameter. For example:
$ docker run --rm -it oduwsdl/carbondate ./main.py -l search http://example.org/index.html -e cdGetBacklinks cdGetArchives
excludes backlinks and archive lookups.
You're correct that the server version does not currently have this feature but I could probably add it. I should also update the docs with all the possible exclusions.
Reading a bit deeper on the issue with using the local version for a URI with parameters, I found that this is an shell issue because the shell thinks &
means send this process to the background. Thats very unfortunate because that means a user will have to will have to escape each of a URIs ampersands by doing \&
. So here is a working example of the URI @martinklein0815 had issues with:
$ docker run --rm -it oduwsdl/carbondate ./main.py -l search "https://www.findagrave.com/cgi-bin/fg.cgi?page=gr\&GRid=166635388" -e cdGetBacklinks cdGetGoogle
with the google and backlink modules removed.
I can't really think of a way to solve this in python. The only way I see it working is to write a bash script to escape the ampersands then passing the escaped URI to the main script
@grantat: Reading a bit deeper on the issue with using the local version for a URI with parameters, I found that this is an OS issue because the OS thinks
&
means send this process to the background. Thats very unfortunate because that means a user will have to will have to escape each of a URIs ampersands by doing\&
.
This is not an issue really. An easier way to use &
in URIs on terminal without escaping is to quote the URI. We do it all the time in curl
for example:
$ curl "http://example.com/?foo=bar&baz=blah"
You're right, I found the suspect in main.py.
It was executing the local version inside python and there was no quoting on the URI. Fix being merged soon with some examples for the README.
Thanks for the help re excluding individual modules, @grantat
Usually, at least with the CLI tools I am using, you either escape special chars or put the string/URI in double quotes. I did not think of doing both.
Pls note that there is some discrepancy between the results returned by the docker version vs web service. The Bitly result does not show up in the docker version:
$ docker run --rm -it oduwsdl/carbondate ./main.py -l search "https://www.findagrave.com/cgi-bin/fg.cgi?page=gr\&GRid=166635388" -e cdGetBacklinks cdGetArchives cdGetGoogle
{
"self": "",
"uri": "https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388",
"estimated-creation-date": "",
"earliest-sources": [],
"sources": {
"bing.com": {
"earliest": ""
},
"last-modified": {
"earliest": ""
},
"bitly.com": {
"earliest": ""
},
"twitter.com": {
"earliest": ""
},
"pubdate": {
"earliest": ""
}
}
}
vs
$ curl "http://carbondate.cs.odu.edu/cd/https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388"
{
"self": "http://carbondate.cs.odu.edu/cd/https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388",
"uri": "https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388",
"estimated-creation-date": "2004-07-31T23:59:59",
"earliest-sources": [
"google.com"
],
"sources": {
"pubdate": {
"earliest": ""
},
"last-modified": {
"earliest": ""
},
"bitly.com": {
"earliest": "2016-08-25T12:01:11"
},
"google.com": {
"earliest": "2004-07-31T23:59:59"
},
"backlinks": {
"earliest": ""
},
"twitter.com": {
"earliest": ""
},
"bing.com": {
"earliest": ""
},
"web.archive.org": {
"uri-m": "http://web.archive.org/web/20170830193741/https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&",
"memento-datetime": "2017-08-30T19:37:41",
"memento-pubdate": "",
"earliest": "2017-08-30T19:37:41"
}
}
}
@martinklein0815: Usually, at least with the CLI tools I am using, you either escape special chars or put the string/URI in double quotes. I did not think of doing both.
That part was a surprise to me too. I hope it will be fixed once #17 is merged. I did talk to @grantat about streamlining the CLI as the way it works currently is not how it should be.
@martinklein0815 Thats because the docker image doesn't provide working API keys for either Bitly or Bing by default. Theres no simple way to change the keys in the image we have right now either. There is an option to write the keys in "dev" mode locally but we've shown how to execute the docker in a one off way. Like @ibnesayeed said the CLI definitely needs some improvement and adding keys for Bitly/Bing should be an argument in the local version without dev mode.
@grantat all the API keys can either be read from the config file or from a well defined environment variables for simplicity. The image or the codebase generally should never have such keys hard coded.
Issue should be fixed now with #17. Updated Readme with examples for using docker environment variables and disabling Carbon Date modules. Also added URL encoding for the google module so it actually searches for URLs with parameters.
@martinklein0815 you can pull the latest docker image and should be able to run the following without needing to escape the ampersand but still need to quote URL:
$ docker run --rm -it oduwsdl/carbondate ./main.py -l search "https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388"
{
"self": "",
"uri": "https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=166635388",
"estimated-creation-date": "2017-08-30T19:37:41",
"earliest-sources": [
"web.archive.org"
],
"sources": {
"web.archive.org": {
"uri-m": "http://web.archive.org/web/20170830193741/https://www.findagrave.com/cgi-bin/fg.cgi?page=gr&",
"memento-datetime": "2017-08-30T19:37:41",
"memento-pubdate": "",
"earliest": "2017-08-30T19:37:41"
},
"twitter.com": {
"earliest": ""
},
"bitly.com": {
"earliest": ""
},
"backlinks": {
"earliest": ""
},
"bing.com": {
"earliest": ""
},
"last-modified": {
"earliest": ""
},
"google.com": {
"earliest": ""
},
"pubdate": {
"earliest": ""
}
}
}