Improve normalization of proteins
Opened this issue · 1 comments
justaddcoffee commented
Describe the bug
At least some proteins need normalization - e.g. ACE2:
UniProtKB:Q9BYF1 ACE2 pharmgkb|intact|go-cams
NCBIGene:59272 ACE2 zhou_host_proteins|SciBite-CORD-19
ENSEMBL:ENSG00000130234 ACE2 STRING # this is the gene, so a separate node arguably is okay (ish)
To Reproduce
$ wget https://kg-hub.berkeleybop.io/kg-covid-19/20210101/kg-covid-19.tar.gz
$ tar xvzf kg-covid-19.tar.gz
$ cut -f1,2,4 merged-kg_nodes.tsv | grep -w -E 'ACE2' | grep -v "^CORD" # ignore CORD-19 papers that mention ACE2 in description
Expected behavior
Should see something like:
UniProtKB:Q9BYF1 ACE2 pharmgkb|intact|go-cams| zhou_host_proteins|SciBite-CORD-19|STRING
Version
version 20210101
justaddcoffee commented
Per presentation by @cmungall at Monarch huddle today, we can improve normalization by doing clique merging with KGX + an SSSOM file