ACE2 interaction present in nt but missing in tsv (20201001 release)
realmarcin opened this issue · 3 comments
Describe the bug
A triple for interacts_with between ACE2 and GLP1R is present in the nt file but not tsv for 20201001 release.
To Reproduce
This triple:
P43220 interacts_with Q9BYF1
is present in the .nt file from 20201001:
https://kg-hub.berkeleybop.io/kg-covid-19/20201001/kg-covid-19.nt.gz
but not in the merged TSV for the release:
(venv) [marcin@n0001 20201001]$ grep ENSP00000389326 merged-kg_edges.tsv | grep ENSP00000362353
(venv) [marcin@n0001 20201001]$
This interaction is present in the transformed STRING TSVs:
(venv) [marcin@n0001 STRING_diff]$ grep ENSP00000389326 20201001_edges.tsv | grep ENSP00000362353
ENSEMBL:ENSP00000362353 biolink:interacts_with ENSEMBL:ENSP00000389326 RO:0002434 STRING biolink:Association 157 0 0 0 0 0 0 0
0 0 0 0 108 94
ENSEMBL:ENSP00000389326 biolink:interacts_with ENSEMBL:ENSP00000362353 RO:0002434 STRING biolink:Association 157 0 0 0 0 0 0 0
0 0 0 0 108 94
(venv) [marcin@n0001 STRING_diff]$ grep ENSP00000389326 20201101_edges.tsv | grep ENSP00000362353
ENSEMBL:ENSP00000362353 biolink:interacts_with ENSEMBL:ENSP00000389326 RO:0002434 STRING biolink:Association 157 0 0 0 0 0 0 0
0 0 0 0 108 94
ENSEMBL:ENSP00000389326 biolink:interacts_with ENSEMBL:ENSP00000362353 RO:0002434 STRING biolink:Association 157 0 0 0 0 0 0 0
0 0 0 0 108 94
(In fact, the 20201001_edges.tsv is identical to 20201101_edges.tsv).
Note that this interaction is absent in both the nt and tsv from 20201101.
The metadata from the 20201001 nt file suggests that STRING is the source and that this interaction is from text mining:
urn:uuid:1aad7d40-aa7a-4ec1-87c9-85108b3eb77f http://www.w3.org/1999/02/22-rdf-syntax-ns#subject http://identifiers.org/uniprot/Q9BYF1 .
urn:uuid:1aad7d40-aa7a-4ec1-87c9-85108b3eb77f http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate https://w3id.org/biolink/vocab/interacts_with .
urn:uuid:1aad7d40-aa7a-4ec1-87c9-85108b3eb77f http://www.w3.org/1999/02/22-rdf-syntax-ns#object http://identifiers.org/uniprot/P43220 .
urn:uuid:1aad7d40-aa7a-4ec1-87c9-85108b3eb77f https://w3id.org/biolink/vocab/relation http://purl.obolibrary.org/obo/RO_0002434 .
urn:uuid:1aad7d40-aa7a-4ec1-87c9-85108b3eb77f https://w3id.org/biolink/vocab/provided_by "STRING" .
urn:uuid:1aad7d40-aa7a-4ec1-87c9-85108b3eb77f http://www.w3.org/1999/02/22-rdf-syntax-ns#type "biolink:Association"^^http://www.w3.org/2001/XMLSchema#string .
urn:uuid:1aad7d40-aa7a-4ec1-87c9-85108b3eb77f https://www.example.org/UNKNOWN/combined_score "157.0"^^http://www.w3.org/2001/XMLSchema#float .
urn:uuid:1aad7d40-aa7a-4ec1-87c9-85108b3eb77f https://www.example.org/UNKNOWN/neighborhood "0.0"^^http://www.w3.org/2001/XMLSchema#float .
urn:uuid:1aad7d40-aa7a-4ec1-87c9-85108b3eb77f https://www.example.org/UNKNOWN/neighborhood_transferred "0.0"^^http://www.w3.org/2001/XMLSchema#float .
urn:uuid:1aad7d40-aa7a-4ec1-87c9-85108b3eb77f https://www.example.org/UNKNOWN/fusion "0.0"^^http://www.w3.org/2001/XMLSchema#float .
urn:uuid:1aad7d40-aa7a-4ec1-87c9-85108b3eb77f https://www.example.org/UNKNOWN/cooccurence "0.0"^^http://www.w3.org/2001/XMLSchema#float .
urn:uuid:1aad7d40-aa7a-4ec1-87c9-85108b3eb77f https://www.example.org/UNKNOWN/homology "0.0"^^http://www.w3.org/2001/XMLSchema#float .
urn:uuid:1aad7d40-aa7a-4ec1-87c9-85108b3eb77f https://www.example.org/UNKNOWN/coexpression "0.0"^^http://www.w3.org/2001/XMLSchema#float .
urn:uuid:1aad7d40-aa7a-4ec1-87c9-85108b3eb77f https://www.example.org/UNKNOWN/coexpression_transferred "0.0"^^http://www.w3.org/2001/XMLSchema#float .
urn:uuid:1aad7d40-aa7a-4ec1-87c9-85108b3eb77f https://www.example.org/UNKNOWN/experiments "0.0"^^http://www.w3.org/2001/XMLSchema#float .
urn:uuid:1aad7d40-aa7a-4ec1-87c9-85108b3eb77f https://www.example.org/UNKNOWN/experiments_transferred "0.0"^^http://www.w3.org/2001/XMLSchema#float .
urn:uuid:1aad7d40-aa7a-4ec1-87c9-85108b3eb77f https://www.example.org/UNKNOWN/database "0.0"^^http://www.w3.org/2001/XMLSchema#float .
urn:uuid:1aad7d40-aa7a-4ec1-87c9-85108b3eb77f https://www.example.org/UNKNOWN/database_transferred "0.0"^^http://www.w3.org/2001/XMLSchema#float .
urn:uuid:1aad7d40-aa7a-4ec1-87c9-85108b3eb77f https://www.example.org/UNKNOWN/textmining "108.0"^^http://www.w3.org/2001/XMLSchema#float .
urn:uuid:1aad7d40-aa7a-4ec1-87c9-85108b3eb77f https://www.example.org/UNKNOWN/textmining_transferred "94.0"^^http://www.w3.org/2001/XMLSchema#float .
Expected behavior
That the nt and tsv semantically mirror each other.
Version
20201001 release
Additional context
Discovered by Tomas Kliegr and group by rule mining on different releases.
Might it be the case that tsv vs rdf is a red herring here?
You are comparing an individual transformed source file with the merged file. It seems more likely something is happening in the merge step, which may be intentional, e.g. clique merge
it is possible indeed -- in fact I am going to close this ticket and shift everything to the other one.
reopening with more info, still .nt vs .tsv difference -- I think both should be products of the same clique merging etc?