biolink/kgx

remove knowledge source parameters from node files (instead rely on provided_by for nodes)

sierra-moxon opened this issue · 4 comments

run a KGX transform with the knowledge_sources parameter, and pass it values for both aggregator_knowledge_source and primary_knowledge_source, then only primary_knowledge_source gets added to the edge file but all of its values are added to a neighboring provided_by column

obojson->tsv in particular.

so far unable to reproduce, with these input and output args:

    input_args = {
        "filename": [
            os.path.join(RESOURCE_DIR, "pato.json")
        ],
        "format": "obojson",
        "provided_by": True,
        "aggregator_knowledge_source": True,
        "primary_knowledge_source": True
    }

    output_args = {
        "filename": os.path.join(TARGET_DIR, "pato-export.tsv"),
        "format": "tsv",
    }

I can not replicate the edge file issues noted in this ticket, but I can see (the expected) provided_by populated in the node file as expected. Since knowledge_source properties are currently association slots, we don't expect them to be found on nodes directly.

One thing we can do to make the node file more understandable w/re to provenance, is to remove the knowledege_source properties that get added there.

fixed with #405