carbon date truncates arguments with "&" in them
phonedude opened this issue · 5 comments
http://carbondate.cs.odu.edu/cd?url=www.cs.odu.edu/foo.cgi&arg1=1&arg2=2
produces:
{
"self": "http://carbondate.cs.odu.edu/cd?url=www.cs.odu.edu/foo.cgi&arg1=1&arg2=2",
"uri": "http://www.cs.odu.edu/foo.cgi",
"estimated-creation-date": "2006-09-13T19:18:54",
...
}
I see whats happening, its counting those arg1 and arg2 parameters as part of carbondate.cs.odu.edu rather than that of the URI specified.
The parameters can make a difference in finding mementos for some thing like that URI:
http://web.archive.org/web/*/www.cs.odu.edu/foo.cgi = 1 memento
http://web.archive.org/web/*/www.cs.odu.edu/foo.cgi&arg1=1&arg2=2 = 0 mementos
However for something like youtube.com we definitely need those parameters.
For example, http://carbondate.cs.odu.edu/cd?url=www.youtube.com/watch&v=Tnf_Brn-zdA
which makes it www.youtube.com/watch which is a redirect to www.youtube.com
and that clearly isn't the video want. We're looking for http://carbondate.cs.odu.edu/cd?url=www.youtube.com/watch?v=Tnf_Brn-zdA.
To correct this I think I'll remove the "/cd=" parameter and create a route such as "/cd/". Open to other suggestions as well.
If I remember correctly, when we were discussing the output JSON structure, I also mentioned that this should be made inline with how other archiving related services work. They take URI as the last path parameter after every significant path prefix in the route. This eliminates the need of explicit URL encoding.
thanks guys. yes, a structure like:
http://carbondate.cs.odu.edu/cd/www.youtube.com/watch&v=Tnf_Brn-zdA
would be better.
Hey @HanySalahEldeen, it's great to hear from you. Hope you are doing good.
Correct me if i am wrong, but isn't that a desired behavior? To clean up
the url from parameters and find the source?
I think non-significant parameters/protocol/subdomain are removed as part of the canonicalization. This is done by most of the web archives, but we can do canonicalization on our end too to take advantage of it in non-archival sources. However, in this report, URL parameters were misses unintentionally, which is a bug.