pkiraly/metadata-qa-api

Simple schema language

Closed this issue · 5 comments

While it's worthwhile to implement existing, language specific schema languages like JSONSchema or SHACL, it might be helpful to have a custom, perhaps YAML-based schema language to handle the simple cases.

A schema reader or parser class would then create a Schema object at runtime. I think this woulb be particularly useful in a compiled language like Java.

For instance, you could have (from EdmFullBeanLimitedSchema.java )

format: json
fields:
 - edm:ProvidedCHO/@about:
      path:  $.['providedCHOs'][0]['about']
      categories: MANDATORY
- Proxy/dc:title:
      path: $.['proxies'][?(@['europeanaProxy'] == false)]['dcTitle']
      categories: 
         - DESCRIPTIVENESS
         - SEARCHABILITY
         - IDENTIFICATION
         - MULTILINGUALITY
- Proxy/dcterms:alternative:
      path: $.['proxies'][?(@['europeanaProxy'] == false)]['dctermsAlternative']
      categories: 
         - DESCRIPTIVENESS
         - SEARCHABILITY
         - IDENTIFICATION
         - MULTILINGUALITY

...

groups:
- 
  fields: 
    - Proxy/dc:title
    - Proxy/dc:description
   categories: MANDATORY

...






Hi @mielvds,

it is a good idea!

I started working on it. You can not yet use it, these are just the initial steps. I neer worked YAML previously, and I run into problems, so I made some changes. I am using a YAML library called snakeyaml.

  1. I do not know how to map fieldnames as keys. I give an explicit property "name" instead
  2. I also do not know how to handle when the categories are sometimes singular and take place in the same line as the property name and sometimes plural and each in distinct line. So I always put them into a distinct line.
  3. I thought it is not important to add groups into distinct document, so I removed the document separator (...)

Here is the result:

format: json
fields:
  - name: edm:ProvidedCHO/@about
    path:  $.['providedCHOs'][0]['about']
    categories:
      - MANDATORY
  - name: Proxy/dc:title
    path: $.['proxies'][?(@['europeanaProxy'] == false)]['dcTitle']
    categories:
      - DESCRIPTIVENESS
      - SEARCHABILITY
      - IDENTIFICATION
      - MULTILINGUALITY
  - name: Proxy/dcterms:alternative
    path: $.['proxies'][?(@['europeanaProxy'] == false)]['dctermsAlternative']
    categories:
      - DESCRIPTIVENESS
      - SEARCHABILITY
      - IDENTIFICATION
      - MULTILINGUALITY
groups:
  - fields:
      - Proxy/dc:title
      - Proxy/dc:description
    categories:
      - MANDATORY

Are these changes are acceptable for you? Do you have any suggestion reading YAML with Java? I think that it would worth to build the same for JSON(-LD) (which is more explicit for me for the time being).

I started working on it. You can not yet use it, these are just the initial steps. I neer worked YAML previously, and I run into problems, so I made some changes. I am using a YAML library called snakeyaml.

I haven't developed with YAML before either, but I find it the most simple and human-readable. JSON would have worked as well, but since we're on this path now, let's continue.

1. I do not know how to map fieldnames as keys. I give an explicit property "name" instead

That's fine. Since you created the Schema design, I think I better leave the decisions up to you 👍

2. I also do not know how to handle when the categories are sometimes singular and take place in the same line as the property name and sometimes plural and each in distinct line. So I always put them into a distinct line.

Yeah not sure, probably you have to use brackets. We could use categories: [DESCRIPTIVENESS, SEARCHABILITY] as well; this is equivalent to the - style list.

3. I thought it is not important to add groups into distinct document, so I removed the document separator (`...`)

Ah, this was just for the example to indicate "and so on", not a document separator. Not very clear, my bad :)

Here is the result:

format: json
fields:
  - name: edm:ProvidedCHO/@about
    path:  $.['providedCHOs'][0]['about']
    categories:
      - MANDATORY
  - name: Proxy/dc:title
    path: $.['proxies'][?(@['europeanaProxy'] == false)]['dcTitle']
    categories:
      - DESCRIPTIVENESS
      - SEARCHABILITY
      - IDENTIFICATION
      - MULTILINGUALITY
  - name: Proxy/dcterms:alternative
    path: $.['proxies'][?(@['europeanaProxy'] == false)]['dctermsAlternative']
    categories:
      - DESCRIPTIVENESS
      - SEARCHABILITY
      - IDENTIFICATION
      - MULTILINGUALITY
groups:
  - fields:
      - Proxy/dc:title
      - Proxy/dc:description
    categories:
      - MANDATORY

Are these changes are acceptable for you? Do you have any suggestion reading YAML with Java? I think that it would worth to build the same for JSON(-LD) (which is more explicit for me for the time being).

Very nice yes. I would be all for JSON(-LD) in fact, YAML was just a suggestion. I suggest we switch, it will be easier and we both have more exprience in that area.

If these configs work out, I suggest to eventually replace all current schema implementations by these. Of course, all features such as SOLR fields and all that should be covered.

I have implemented the JSON version:

{
  "format": "json",
  "fields": [
    {
      "name": "edm:ProvidedCHO/@about",
      "path":  "$.['providedCHOs'][0]['about']",
      "categories": ["MANDATORY"]
    },
    {
      "name": "Proxy/dc:title",
      "path": "$.['proxies'][?(@['europeanaProxy'] == false)]['dcTitle']",
      "categories": [
        "DESCRIPTIVENESS",
        "SEARCHABILITY",
        "IDENTIFICATION",
        "MULTILINGUALITY"
      ]
    },
    {
      "name": "Proxy/dcterms:alternative",
      "path": "$.['proxies'][?(@['europeanaProxy'] == false)]['dctermsAlternative']",
      "categories": [
        "DESCRIPTIVENESS",
        "SEARCHABILITY",
        "IDENTIFICATION",
        "MULTILINGUALITY"
      ]
    }
  ],
  "groups": [
    {
      "fields": [
        "Proxy/dc:title",
        "Proxy/dc:description"
      ],
      "categories": [
        "MANDATORY"
      ]
    }
  ]
}

I'll add validation, and then additional features.

@mielvds

One step further: converting configuration to a Schema object.

Creating Schema from a YAML configuration file (readYaml):

Schema schema = ConfigurationReader.readYaml("path/to/schema.yaml").asSchema();
                                        ^^^^                 ^^^^

Creating Schema from a JSON configuration file (readJson):

Schema schema = ConfigurationReader.readJson("path/to/schema.json").asSchema();
                                        ^^^^                 ^^^^

Note: It does not implements every properties of the Schema yet.

@pkiraly Is the category a controlled list? I got this error when trying to group using the arbitrary "ID" string.


Exception in thread "main" java.lang.IllegalArgumentException: No enum constant de.gwdg.metadataqa.api.model.Category.ID
	at java.lang.Enum.valueOf(Enum.java:238)
	at de.gwdg.metadataqa.api.model.Category.valueOf(Category.java:13)
	at de.gwdg.metadataqa.api.util.SchemaFactory.fromConfig(SchemaFactory.java:30)
	at de.gwdg.metadataqa.api.configuration.Configuration.asSchema(Configuration.java:38)
	at be.meemoo.App.main(App.java:43)

Would it be an idea to enable custon extentions of this list with arbitrary groups?