pkiraly/metadata-qa-api

JSON output not as expected

Closed this issue · 4 comments

Hi @pkiraly !

The readme says the ouput json records of a basic completeness test should look like this

{
  "completeness":{
    "completeness":{
      "TOTAL":0.35294117647058826,
      "MANDATORY":1.0
    },
    "existence":{
      "url":true,
      "name":true,
      ...
    },
    "cardinality":{
      "url":1,
      "name":1,
      ...
    }
  }
}

(btw, why is the completeness key repeated? There's no need)

However, running calculator.measureAsJson(record); gives me something like

{
  "completeness":{
      "url":1,
      "name":1,
      ...
    }
  }
}

This doesn't seem right?

@mielvds

  1. the "completeness" class might measure 3 things: completeness, existence and cardinality. You can set if you want all the 3 or only one. But you are right, maybe we do not need this outer "completeness" in the output.

  2. I will check it. It doesn't seem logical, or conform to the documentation. Could you provide the measurement configuration? (either here or via email)

Thanks!

Will add a sample tomorrow. Btw, the CSV output comes out fine

Hi @pkiraly,

I tried reproducing the issue, but somehow the output comes out okay now! Maybe something in your new release, or I messed up somewhere.

{"fieldExtractor":{"fieldExtractor":{"recordId":"0000008c000a4b2cb90eefb7b131289d728fc57cc25946c2aca6ccb0820857da69f1ef620c2b4c99a668cb38062bf45c"}},"completeness":{"completeness":{"TOTAL":0.875,"mandatory_if_present":0.875},"existence":{"recordId":true,"fragment_title":true,"fragment_description":true,"MDProperties.PID":true,"MDProperties.dc_identifier_localid":true,"MDProperties.CP_id":true,"MDProperties.dc_title":true,"MDProperties.dc_description_lang":false},"cardinality":{"recordId":1,"fragment_title":1,"fragment_description":1,"MDProperties.PID":1,"MDProperties.dc_identifier_localid":1,"MDProperties.CP_id":1,"MDProperties.dc_title":1,"MDProperties.dc_description_lang":0}}}
{"fieldExtractor":{"fieldExtractor":{"recordId":"000004e4c76848c9b1beba651a6b4c33a85185941e9e4e078531f371822cae14f2fc0efaca9f4fd5a95983f1382ac39d"}},"completeness":{"completeness":{"TOTAL":0.875,"mandatory_if_present":0.875},"existence":{"recordId":true,"fragment_title":true,"fragment_description":true,"MDProperties.PID":true,"MDProperties.dc_identifier_localid":true,"MDProperties.CP_id":true,"MDProperties.dc_title":true,"MDProperties.dc_description_lang":false},"cardinality":{"recordId":1,"fragment_title":1,"fragment_description":1,"MDProperties.PID":1,"MDProperties.dc_identifier_localid":1,"MDProperties.CP_id":1,"MDProperties.dc_title":1,"MDProperties.dc_description_lang":0}}}
{"fieldExtractor":{"fieldExtractor":{"recordId":"0000b533920a4ac4bcda403c2e037a0ef2ec8401de0448e1aa0972d6f08f86257ee49ef1777245179b94de27ef58974f"}},"completeness":{"completeness":{"TOTAL":1.0,"mandatory_if_present":1.0},"existence":{"recordId":true,"fragment_title":true,"fragment_description":true,"MDProperties.PID":true,"MDProperties.dc_identifier_localid":true,"MDProperties.CP_id":true,"MDProperties.dc_title":true,"MDProperties.dc_description_lang":true},"cardinality":{"recordId":1,"fragment_title":1,"fragment_description":1,"MDProperties.PID":1,"MDProperties.dc_identifier_localid":1,"MDProperties.CP_id":1,"MDProperties.dc_title":1,"MDProperties.dc_description_lang":1}}}

About the repetition of fields names:
It's a good idea to separate metadata like the fieldExtractor and measurements like completeness, existence and cardinality. My suggestion wou be to simply group them in these two categories. One output record would then be

{
  "metadata": {
    "fieldExtractor": { "recordId":"000004e4c76848c9b1beba651a6b4c33a85185941e9e4e078531f371822cae14f2fc0efaca9f4fd5a95983f1382ac39d" }
  },
  "results": {
    "completeness": {...},
    "existence": {...},
    "cardinality": {...}, 
  }
}

For future reference:

Data:

{"recordId": "0000008c000a4b2cb90eefb7b131289d728fc57cc25946c2aca6ccb0820857da69f1ef620c2b4c99a668cb38062bf45c", "MDProperties.sp_name": "w1", "MDProperties.CP": "orgX", "MDProperties.PID": "2b8vb0bn5z", "MDProperties.dc_identifier_localid": "BR0000035282129", "MDProperties.CP_id": "OR-xxxxx", "MDProperties.dc_title": "BR0000035282129.tif"}
{"recordId": "000004e4c76848c9b1beba651a6b4c33a85185941e9e4e078531f371822cae14f2fc0efaca9f4fd5a95983f1382ac39d", "MDProperties.sp_name": "w1", "MDProperties.CP": "orgX", "MDProperties.PID": "ns0ks8px51", "MDProperties.dc_identifier_localid": "BR0000036392872", "MDProperties.CP_id": "OR-xxxxx", "MDProperties.dc_title": "BR0000036392872.tif"}
{"recordId": "0000b533920a4ac4bcda403c2e037a0ef2ec8401de0448e1aa0972d6f08f86257ee49ef1777245179b94de27ef58974f", "MDProperties.sp_name": "w2", "MDProperties.dcterms_issued": "20201-03-16", "MDProperties.dc_identifier_localid": "xxxxxxxxx", "MDProperties.CP_id": "OR-xxxxx", "MDProperties.dc_description_lang": "kjhewflwjfiowejfgporwf", "MDProperties.CP": "xxx", "MDProperties.PID": "xxxxxxx", "MDProperties.dc_title": "Some title", "MDProperties.dc_subjects.Trefwoord": ["KEYWORD"], "MDProperties.dc_contributors.Reporter": "Reporter", "MDProperties.dc_titles.programma": "Prgramma", "MDProperties.dc_titles.serie": "Serie X", "MDProperties.dcterms_created": "2000-04-16", "MDProperties.dc_identifier_localids.MEDIA_ID": "15d2e63f"}

Schema:

fields:
  - categories:
      - mandatory_if_present
    name: recordId
    path: $.['recordId']
    extractable: true
  - categories:
      - mandatory_if_present
    name: fragment_title
    path: $.['MDProperties.sp_name']
  - categories:
      - mandatory_if_present
    name: fragment_description
    path: $.['MDProperties.CP']
  - categories:
      - mandatory_if_present
    name: MDProperties.PID
    path: $.['MDProperties.PID']
  - categories:
      - mandatory_if_present
    name: MDProperties.dc_identifier_localid
    path: $.['MDProperties.dc_identifier_localid']
  - categories:
      - mandatory_if_present
    name: MDProperties.CP_id
    path: $.['MDProperties.CP_id']
  - categories:
      - mandatory_if_present
    name: MDProperties.dc_title
    path: $.['MDProperties.dc_title']
  - categories:
      - mandatory_if_present
    name: MDProperties.dc_description_lang
    path: $.['MDProperties.dc_description_lang']
  - categories:
      - mandatory_if_present
    name: MDProperties.dc_description_lang
    path: $.['MDProperties.dc_subjects.Trefwoord']
format: json

@pkiraly I'll close this, because when merging like we said in #83, the json output format can be revised