Knowledge-Graph-Hub/kg-covid-19

Improve normalization of proteins

Opened this issue · 1 comments

Describe the bug

At least some proteins need normalization - e.g. ACE2:

UniProtKB:Q9BYF1        ACE2    pharmgkb|intact|go-cams
NCBIGene:59272  ACE2    zhou_host_proteins|SciBite-CORD-19
ENSEMBL:ENSG00000130234 ACE2    STRING  # this is the gene, so a separate node arguably is okay (ish)

To Reproduce

$ wget https://kg-hub.berkeleybop.io/kg-covid-19/20210101/kg-covid-19.tar.gz
$ tar xvzf kg-covid-19.tar.gz
$ cut -f1,2,4 merged-kg_nodes.tsv | grep -w -E 'ACE2' | grep -v "^CORD" # ignore CORD-19 papers that mention ACE2 in description

Expected behavior

Should see something like:

UniProtKB:Q9BYF1 ACE2 pharmgkb|intact|go-cams| zhou_host_proteins|SciBite-CORD-19|STRING

Version

version 20210101

Per presentation by @cmungall at Monarch huddle today, we can improve normalization by doing clique merging with KGX + an SSSOM file