oduwsdl/CarbonDate

Reformat JSON

grantat opened this issue · 3 comments

Currently the JSON format is not uniform. Each key should be conformed to either camel case or underscores, not both. A sample JSON on server mode:

{
  "self": "http://localhost:8888/cd?url=http://www.cnn.com/2017/07/04/politics/us-officials-meet-north-korea-missile-launch/index.html",
  "URI": "http://www.cnn.com/2017/07/04/politics/us-officials-meet-north-korea-missile-launch/index.html",
  "Estimated Creation Date": "2017-07-04T15:10:24",
  "Archives": {
    "Earliest": "2017-07-04T15:28:32",
    "By_Archive": [
      {
        "URI": "http://web.archive.org/web/20170704152832/http://www.cnn.com/2017/07/04/politics/us-officials-meet-north-korea-missile-launch/index.html",
        "memento_datetime": "2017-07-04T15:28:32",
        "pubdate": "2017-07-04T15:10:24"
      },
      {
        "URI": "//wayback.archive-it.org/all/20170704185254/http://www.cnn.com/2017/07/04/politics/us-officials-meet-north-korea-missile-launch/index.html",
        "memento_datetime": "2017-07-04T18:52:54",
        "pubdate": "2017-07-04T23:59:59"
      },
      {
        "URI": "http://archive.is/20170704205543/http://www.cnn.com/2017/07/04/politics/us-officials-meet-north-korea-missile-launch/index.html",
        "memento_datetime": "2017-07-04T20:55:43",
        "pubdate": "2017-07-04T20:55:43"
      }
    ]
  },
  "Backlinks": "2017-07-04T15:28:32",
  "Bing.com": "",
  "Bitly.com": "2017-07-04T15:10:25",
  "Google.com": "2017-07-04T23:59:59",
  "Last Modified": "",
  "Pubdate tag": "2017-07-04T15:10:24",
  "Twitter.com": "2017-07-04T19:28:34"
}

The "Estimated Creation Date" date key can be transformed to "estimated_creation_date" using underscores.

Another idea discussed with @ibnesayeed, is to use grouping for services leaving the first three keys, "self," "uri" and "estimated_creation_date," outside of this group. Earliest archive date could be removed since estimated_creation_date is the overall earliest date that we're concerned with. It could be as follows:

{
  "self": "",
  "uri": "",
  "estimated_creation_date": "",
  "services": {
    "backlinks": "",
    "bing": "",
    "bitly": "",
    "google": "",
    "last_modified": "",
    "pubdate": "",
    "twitter": "",
    "archives": [
        {
          "uri": "",
          "memento_datetime": "",
          "pubdate": ""
        }
      ]
  }
}

I was going to say that there is some utility in keeping the earliest archival date, but I guess it's not necessary.

I wonder if we should add a note saying which service "won"? In the example above, all the services have variations of "2017-07-04" and it takes a sec to scan and see that pubdate "wins".

If I were to organize it, I would put it something like this:

{
	"self": "http://example.com/cd?url=http://www.cnn.com/",
	"uri": "http://www.cnn.com/",
	"estimated-creation-date": "2017-07-04T15:10:24",
	"earliest-sources": ["pubdate", "web.archive.org"],
	"sources": {
		"backlinks": {
			"earliest": "2017-07-04T15:28:32"
		},
		"last-modified": {
			"earliest": ""
		},
		"pubdate": {
			"earliest": "2017-07-04T15:10:24"
		},
		"bing.com": {
			"earliest": ""
		},
		"bitly.com": {
			"earliest": "2017-07-04T15:10:25"
		},
		"google.com": {
			"earliest": "2017-07-04T23:59:59"
		},
		"twitter.com": {
			"earliest": "2017-07-04T19:28:34"
		},
		"web.archive.org": {
			"uri-m": "http://web.archive.org/web/20170704152832/http://www.cnn.com/",
			"memento-datetime": "2017-07-04T15:28:32",
			"pubdate": "2017-07-04T15:10:24",
			"earliest": "2017-07-04T15:10:24"
		},
		"wayback.archive-it.org": {
			"uri-m": "//wayback.archive-it.org/all/20170704185254/http://www.cnn.com/",
			"memento-datetime": "2017-07-04T18:52:54",
			"pubdate": "2017-07-04T23:59:59",
			"earliest": "2017-07-04T18:52:54"
		},
		"archive.is": {
			"uri-m": "http://archive.is/20170704205543/http://www.cnn.com/",
			"memento-datetime": "2017-07-04T20:55:43",
			"pubdate": "2017-07-04T20:55:43",
			"earliest": "2017-07-04T20:55:43"
		}
	}
}

I have only put the summary and meta information at the top level. Everything else goes into the sources block that is uniformly formatted with each containing a mandatory attribute earliest, even if it is empty. Each source block can have additional attributes, such as title or attributes that are shown in the archival sources like uri-m or memento-datetime (note that they still have the earliest attribute in them). Naming convention here is to keep every attribute name in downcase letters and use hyphen - for joining words. I have also added a top-level earliest-sources attribute in response to @phonedude's idea, but I kept the value in an array because there might be cases where more than one source are pointing to the same creation date. So it is better to keep it consistent and always put them in an array, even if there is only one winner source. The array contains the keys of corresponding sources block.

With such organization and conventions, it is very easy and predictable for clients to consume the service. Also, it streamlines the kind of response you expect from each module while adding more sources in future. Just by looking at uri, estimated-creation-date, and earliest-sources attribute an application can state something like this very easily:

Based on {earliest-sources}, {uri} first came to life at {estimated-creation-date}.

This was resolved in #8, gonna go ahead and close it.