-
Entity Class: The user can either manually select that the topic entity of the pages are
Movie
orFilm
by searching for it in the ontology schema, or the system can attempt to automatically infer it via distant supervision with existing overlapping data. -
Entity Linking: After we establish the topic entity class, individual entities in the extracted data need to be matched with existing data:
- For example, the entity label may be
The Dark Knight
, but the existing ontology data may have the labelThe Dark Knight (film)
. Another example isA Star is Born
, for which there is a 1976 and 2018 version.
- For example, the entity label may be
For entities and literal values E_R
that are properties of the topic entity E_T
,
such that (E_T, R, E_R)
or (E_R, R, E_T)
are triples for some relation R
, the
goal is to either match E_R
to existing entities in the KB or to create new
ones, and to match R
to existing relations in the KB or create new ones if
needed.
There are a few cases to consider:
-
Existing/extracted data matches: The extracted data (roughly) matches the triple already in the KB. It may be possible in these cases to detect the relation
R
automatically using the similarity between the existing and extracted data, as well as by comparing the extracted page "relation labels" (if they exist) to the schema. There are many cases where values don't exactly match, with no simple automatic solution:-
Naming discrepancies: IMDB lists the production company of
The Dark Knight
as the array["Warner Bros.", "Legendary Entertainment"]
, while RT uses the stringWarner Bros. Pictures/Legendary
. Similarly, IMDB lists Christian Bale's character asBruce Wayne
, while RT lists it asBatman/Bruce Wayne
. -
Unit differences: IMDB lists the runtime in two formats:
2h 32min
and152 min
, whereas DBPedia has two relations for runtime - one in minutes:152
, and the other in seconds:9120
. -
Precision differences: The gross box office is listed precisely as
$1,005,456,758
for IMDB while DBPedia lists it as1.005E9
.
-
-
Extracted values not in KB: These are cases where the extracted
E_R
values have corresponding attributes in the existing schema, but the existing KB does not containE_R
.-
If the data exists in the KB for other entities, one possibility is to use the global similarity of the extracted values to the existing values to infer
R
even for pages that there is no existing data for. -
If we can guess the class of
E_R
, we can narrow the options ofR
from the existing schema and present them to the user. -
If "relation labels" exist from the extracted data, we can also use the similarity to relations in the schema to rank
R
candidates. -
Even with all these heuristics, the user will likely need to be consulted.
-
-
Source-specific semantics/data: In some cases, the existing data and new data come from different sources and each must be independently preserved.
-
Source-specific semantics: IMDb and RT use different taxonomies for
genre
. For example, one listsAction
andThriller
, the other listsAction & Adventure
, and we should keep both as they are don't necessarily mean the same thing. Another example is the rating system, which is out of 10.0 for IMDb but out of 100 (and aggregated from critics) for RT. -
Source-specific data: We should not attempt to merge and match a review on IMDb with a review on RT. They are distinct by nature.
-
Another dimension to consider could be time - the same source might report two different values depending on extraction time - such as Amazon product prices.
-
There is no way to automatically detect such situations, unless metadata indicates so (e.g. mark such relations in the schema as source-specific).
-
-
Missing semantics in the schema: The user/system must realize when the schema doesn't account for the data.
-
Is there a way to identify this automatically?
-
Complete miss: Need to add new relation to schema.
-
Partial miss: An example is where the IMDb website pairs an actor and character, a release date with release location, and a writer with the content they contributed. There are a few sub-cases for this:
- Both relations exist in the schema, but the pairing relationship does not. (How does this generalize when n >= 2?)
- Only one of the relations currently exist in the schema.
-
- When looking at a single page, the schema is pretty "flat" - i.e. a tree with many branches, but low depth.
- Knowledge from one branch is unlikely to transfer to another, because they are dealing with different properties with different semantics, and these semantics don't recur often.
- While a few sub-problems such as inferring entity classes might be possible to automate at scale using the noisy signal of overlapping extracted/existing data (the primary contribution of Luna/Colin's line of work), most of the sub-problems require significant user intervention. For example, only a user with domain expertise can realistically classify whether a mismatch is due to naming discrepancy or a difference in semantics.