pkiraly/metadata-qa-api

Make the Categories configurable

Opened this issue · 7 comments

@mielvds suggested the following: Would it be an idea to enable custon extentions of this list with arbitrary groups?

Right now category is an enumeration, and can not be configured.

yep, that'll work!

@mielvds

I have two ideas, I would like to ask you your opinion.

Case 1: the categories on the field level is arbitrary, so you can add anything you want, there is no check at all.

Case 2: you should either add a "categories" list on schema level as well, which contains all possible values. It behaves as a controlled vocabulary, and the field level category MUST BE in this list.

Here is an example for Case 2:

format: json
fields:
  - name: edm:ProvidedCHO/@about
    path:  $.['providedCHOs'][0]['about']
    categories:
      - MANDATORY
  - name: Proxy/dc:title
    path: $.['proxies'][?(@['europeanaProxy'] == false)]['dcTitle']
    categories:
      - DESCRIPTIVENESS
      - SEARCHABILITY
      - IDENTIFICATION
      - MULTILINGUALITY
      - CUSTOM
  - name: Proxy/dcterms:alternative
    path: $.['proxies'][?(@['europeanaProxy'] == false)]['dctermsAlternative']
    categories:
      - DESCRIPTIVENESS
      - SEARCHABILITY
      - IDENTIFICATION
      - MULTILINGUALITY
groups:
  - fields:
      - Proxy/dc:title
      - Proxy/dc:description
    categories:
      - MANDATORY
categories:
  - MANDATORY
  - DESCRIPTIVENESS
  - SEARCHABILITY
  - IDENTIFICATION
  - MULTILINGUALITY
  - CUSTOM

Note the categories in the last section. It give the schema create a bit more work, but keep the consistency of the categories. If it is missing the default list which the tool will compare the field categories against will be the current enumeration.

Would you vote for case 1 or case 2?

I would say 1 because 2 won't add much in practice except for redundancy and less transparency. You can still implement the current situation if the categories on the field level are arbitrary. In fact, the schema doesn't even have to change.

Agreed, there is no check, but I think it's up to the writer of the schema to do it properly :) The worst that can happen is that the results are wrongly classified. Repeating the list on the schema level won't entirely avoid this mistake from happening.

But I don't have the full picture of course, is there's something I'm missing about the current list of categories?

Thanks!

There is only one more thing I forget to mention. The final output (the order of columns in the output CSV or in the Java collection) could be sorted against this canonical list. Otherwise the order will be set on first come first served basis.

I implemented it, but it requires some changes in the API. It is not anymore possible to add categories into JsonPath constructor, one has to use setCategories(List<String>) or setCategories(String...) or you can use the good old Category enum: setCategories(Category...).

Here is an example.

Old style

new JsonBranch("Proxy/dc:title", "$.['dcTitle']", 
    Category.DESCRIPTIVENESS,
    Category.SEARCHABILITY,
    Category.IDENTIFICATION,
    Category.MULTILINGUALITY);

new style:

new JsonBranch("Proxy/dc:title", "$.['dcTitle']")
    .setCategories(
      Category.DESCRIPTIVENESS,
      Category.SEARCHABILITY,
      Category.IDENTIFICATION,
      Category.MULTILINGUALITY
    );

One more thing: if the schema configuration has the categories property, the individual fields' categories are checked against that list, and the API filter out categories which are not listed.

Sounds good to me!

@pkiraly I think you can close this one