OBOFoundry/OBO-Metadata-Editor

Use ruamel.yaml to improve YAML line determination for JSONSchema errors

Closed this issue · 2 comments

One of the most fragile parts of our current code is how we determine the line number of a problem found in the YAML. The JSON Schema validator takes a Python data structure as the input, and ValidationError objects contain path information pointing to the problematic part of the data structure, but when we use PyYAML to read the YAML the line numbers are lost. So we've got a system using regular expressions to try and follow the path to the relevant line of the YAML.

ruamel.yaml is a YAML parsing library designed to support round-trip conversion of YAML files through Python, including formatting and comments. This means that when it loads YAML, the data structure it returns still contains line and column numbers. So given the path to the problem, we can find that part of the data structure and then look up the line and column numbers. Here's an example:

import json
import jsonschema
import sys

from ruamel.yaml import YAML

yaml=YAML()
with open("obi.md") as f:
    data = yaml.load(f)

with open("registry_schema.json") as f:
    registry_schema = json.load(f)

try:
    jsonschema.validate(data, registry_schema)
    print("SUCCESS")
except jsonschema.exceptions.ValidationError as err:
    print(err.message)
    keys = list(err.path)
    if err.validator == "additionalProperties":
        m = err.message
        key = m[m.index("'")+1:m.rindex("'")] # get the added key name; this is hacky
        keys.append(key)
    if len(keys) > 1:
        subset = data
        while len(keys) > 1:
            subset = subset[keys.pop(0)] # follow the path, stopping with one key left
        pos = subset.lc.data[keys[0]] # get ruamel.yaml's line-column information
        print(f"at line {pos[0] + 1}, column {pos[1] + 1}")
    sys.exit(1)

I think this approach will be much more robust, and I hope that it won't be to hard to replace our current code.

I didn't find the official docs all that helpful for what I wanted to do:

https://yaml.readthedocs.io/

Instead I relied on the source code:

https://sourceforge.net/p/ruamel-yaml/code/ci/default/tree/comments.py

Much better. It also handles the duplicate key validation. :-D