biolink/kgx

Duplication of identifiers in pipe-delimited slot value lists

RichardBruskiewich opened this issue · 3 comments

It is (possibly) noted that some fields - e.g. provided_by slot - in KGX sometimes tend to accumulate duplicate (CURIE) identifiers. Rather, such lists should be managed internally as proper sets (without member duplication)?

In particular, we need to check the kgx merge operation for this anomaly, but also, perhaps other contexts.

I think this is fixed in : #408 - making a note to check.

Do we have a unit test to check this?

@sierra, is the relevant code in https://github.com/biolink/kgx/blob/master/kgx/utils/kgx_utils.py#L831? I'm not sure if this snippet of code avoids duplication in pipe-delimited lists...

I applied a fix of the above snippet of code in the List related PR #415