metaphacts/semopenalex

merge Geo nodes

Opened this issue · 0 comments

Currently each Institution has its own Geo node, so there are a lot:

PREFIX soa: <https://semopenalex.org/ontology/>
select (count(*) as ?c) {
  ?x a soa:Geo
} # 106956

Some queries will be more convenient if you merge the equivalent nodes,
eg "which city has the most publications by institutions located in that city"

If you do #76 and enable owl:sameAs reasoning, the merging will be done automatically because:

<https://semopenalex.org/geo/I200650556> owl:sameAs <https://sws.geonames.org/3149318/>.
<https://semopenalex.org/geo/I1234567890> owl:sameAs <https://sws.geonames.org/3149318/>.

will make them be sameAs each other.

But there are a couple of problems.

1: Not all Geo nave geonames link:

PREFIX soa: <https://semopenalex.org/ontology/>
PREFIX gn: <http://www.geonames.org/ontology#>
select (count(*) as ?c) {
  ?x a soa:Geo
  filter not exists  {?x gn:geonamesID ?id}
} # 4593

2: If two names for the same city (eg "Washington DC" vs "Washington, D.C." are in two Geo nodes,
then the merged node will obtain two labels, which is not ideal.
Even worse with wgs:lat, long, which are expected to differ by some small number.

So: a more thorough data fusion procedure will be needed.