clulab/reach

Remove duplicate entries from Bioresources

Closed this issue · 7 comments

I am about to remove the duplicate entry for E3:

['E3', '5756', '', 'pubchem', 'Simple_chemical'],
['E3', 'E3_Ub_ligase', '', 'fplx', 'Family']]

@MihaiSurdeanu Which one is the correct entry?
Also, I detected more entities with duplicate but identical entries in NER-Grounding-Override.tsv. I will remove the redundant ones.

Now that I am on this, should I look for duplicate entries in the other KB files?

Not sure. @bgyori ?

You should check the override KB for duplicates. Probably not the other ones.
Thanks!

See also the list at the bottom of #742 .

Thanks @kwalcock . I am not sure of how to deal with the other duplicates. Consider axin, it appears in uniprot, but has a manual override. Does it make sense to remove it from uniprot? I think it doesn't

Duplicates in the overrides: interesting, I don't think having duplicates in there make sense so we should probably remove those - if possible I'd like to take a look at the choices to see if they make sense. As for the other files, duplicates at the level of the entity string are normal ambiguities that are to be expected so we shouldn't remove them.

I think that they have duplicates within the overrides because I stopped adding to the list before it got to processing the regular KBs. As usual, I may be mistaken.

Actually, I could work on eliminating the override duplicates and push here, shall I do that?