Graph Resolution - zentity 2.0
Opened this issue · 1 comments
Graph Resolution - zentity 2.0
One of the most popular feature requests I've heard from the community has been to support the resolution of multiple entities and their relations. This issue documents my thoughts on how to implement that in zentity. I call this feature graph resolution because it introduces concepts of graph theory and entails resolving entities and relations in a graph. The feature would be significant enough to warrant the promotion of zentity to version 2.0. The actual implementation may differ from my initial outline below.
Foundational requirements
These are the minimum required capabilities of graph resolution in zentity:
- Generating IDs for entities and relations - zentity must generate unique identifiers (
_zid
) for entities and relations. - Modeling relations - zentity must provide a way for users to model relations between entities.
- Resolving relations - zentity must be able to apply relationship models to track relations between entities, and return those in the response of a resolution job.
- Resolving multiple entities in one request - zentity must be able to return multiple entities in the response of a single resolution job.
- Extracting entities from documents - zentity must be able to extract multiple entities from a given document.
- Performing transitive closure - zentity must track the associations between
_doc
and_zid
. Whenever a_doc
appears for multiple_zid
, zentity must merge the entities of those_zid
and their relations.
Optimizations
These are optimizations that can be moved to a subsequent minor version release if needed:
- Scoping graph resolution - zentity should be able to scope resolution jobs by entity type and relation type, in addition to the current accepted scope of attributes, resolvers, and indices.
- Limiting graph traversal - zentity will need a parameter in the resolution job to limit its searches on linked entities by some number of degrees of separation from the entities in the request.
1. Generating IDs for entities and relations
zentity must generate unique identifiers for entities and relations. I will call this identifier a _zid
(short for "zentity ID"). The _zid
should be a composite value of existing data that together would uniquely identify an entity or relation.
1.1 Entity _zid
Proposed syntax of _zid
for entities:
ENTITY_TYPE|ENTITY_INSTANCE|INDEX_NAME|base64(DOC_ID)
Defined as the following:
ENTITY_TYPE
is the name of the entity model.ENTITY_INSTANCE
is an incrementing counter that differentes multiple instances of the entity type within a document. I expect this always to be0
for now, until zentity supports treating nested objects as individual entities.INDEX_NAME
is the name of the first index in which a document for the entity was found.base64(DOC_ID)
is the base64-encoded value of the_id
of the first document in which the entity was found.- The values are concatenated in the order listed above with a pipe (
|
).
Example (using the cross-cluster search syntax for the index name to show why a colon :
shouldn't be used as a delimiter for the _zid
):
person|0|us:my_index|Mg==
Benefits of the proposed syntax of _zid
for entities:
- Fast - Concatenating the values is much faster than computing an encoding or a hash digest.
- Intuitive - Key information about the entity is readily apparent in the identifier. This will be useful when viewing the raw data of the relations between entities, as the relationship objects should only display the
_zid
for each entity for the sake of brevity. - Safe (de)serialization - The pipe symbol (
|
) is not allowed in entity names, attribute names, or index names. This means we can safely use it to concatenate the proposed values. A doc_id
could contain this symbol, hence the requirement to use base64 encoding of the doc_id
to allow for safe usage of the pipe delimiter. - Deterministic* - zentity performs entity resolution deterministically. If you submit the same request to the Resolution API twice, zentity will query the same indices in the same order. Thus, the proposed method of using the name of the first queried index and the
_id
of the first returned hit will yield the same_zid
, as long as the state of the indices and their documents hasn't changed between those requests (see note below).
*Note - The
_zid
will NOT always be guaranteed to be the same across multiple responses from the Resolution API. They are ephemeral, and should be used only to uniquely identify the entities and relations of a single resolution request. Persisting these would be in scope of a future enhancement to persist and manage the outputs of entity resolution.
1.2 Relation _zid
Proposed syntax of _zid
for relations:
RELATION_TYPE#RELATION_DIRECTION#_ZID_A#_ZID_B
Defined as the following:
RELATION_TYPE
is the name of the relation model (or an empty value).RELATION_DIRECTION
is the direction of the relation (a>b
,a<b
,a<>b
, or an empty value)._ZID_A
is the_zid
of entitya
in the relation._ZID_B
is the_zid
of entityb
in the relation.- The values are concatenated in the order listed above with a hash (
#
). A hash is used instead of a pipe (|
) because pipes will already appear in_ZID_A
and_ZID_B
Examples:
residence#a>b#person|0|us:my_index|Mg==#address|0|us:my_index|Mg==
- A relation where the type isresidence
and the direction isa>b
.residence##person|0|us:my_index|Mg==#address|0|us:my_index|Mg==
- A relation where the type isresidence
that it has no direction.#a>b#person|0|us:my_index|Mg==#address|0|us:my_index|Mg==
- A relation that has no relation type and the direction isa>b
.##person|0|us:my_index|Mg==#address|0|us:my_index|Mg==
- A relation that has no relation type and no direction.
Benefits of the proposed syntax of _zid
for relations:
- Fast - For the same reasons as entity
_zid
. - Intuitive - For the same reasons as entity
_zid
. - Safe (de)serialization - For the same reasons as entity
_zid
. The hash symbol (#
) is not allowed in entity names, attribute names, or index names, and it will not appear in the_zid
of entitiesa
andb
. This means we can safely use it to concatenate the proposed values. - Deterministic* - For the same reasons as the entity
_zid
.
2. Modeling relations
zentity must provide a way for users to model the relations between entities as they appear in documents. These relations could be either typed or untyped, and either directional, bidirectional, or undirected. A default relation could be untyped and undirected, representing the co-occurrence of two entities in a document.
Index name for relation models:
.zentity-models-relations
Relation model:
{
"index": INDEX_NAME,
"type": RELATION_TYPE,
"direction": RELATION_DIRECTION,
"a": ENTITY_TYPE,
"b": ENTITY_TYPE
}
"index"
(Required) - The name of the index in which the relation appears."type"
(Optional) - An arbitrary string that describes the relation between the two entities (e.g."lives at"
,"parent of"
,"child of"
,"owner of"
). Can benull
or omitted to represent an untyped relation."a"
(Required) - The entity type of one entity in the relation."b"
(Required) - The entity type of the other entity in the relation."direction"
(Optional) - A string that specifies the direction (or lack thereof) between entities"a"
and"b"
.- Direcitonal values:
"a>b"
,"a<b"
) - Bidirectional values: (
"a<>b"
) - Undirected values:
null
or omitted - Uppercase and lowercase should be accepted for these values, but the API handler should lowercase everything before saving the document to the
.zentity-models-relations
index. - Whitespace should be accepted for these values, but the API handler should strip the whitespace before saving the document to the
.zentity-models-relations
index. - The order of
"a"
or"b"
should be accepted either way for these values, but the API hanlder should sort them before saving the document to the.zentity-models-relations
index. - Regular expression for accepted values (prior to normalization):
^\s*[abAB]\s*(<+\s*-*|-*\s*>+|<+\s*-*\s*>+)\s*[abAB]\s*$
- Direcitonal values:
3. Resolving multiple entities in one request
Currently, zentity performs entity resolution for a single entity. The request accepts inputs for a single entity, and the reponse provides data for a single entity.
Graph resolution MUST have the response provide data for one or many entities, and SHOULD have the request allow inputs for multiple entities.
3.1 Resolution API Request
Expected changes that preserve backwards compatibility:
- Requests can express multiple entities as inputs to the resolution job in an
"entities"
field. If the user doesn't supply an"entities"
field, zentity will fall back onto the current syntax for resolution requests.
Expected breaking changes:
- Responses should always contain
"entities"
and"relations"
as top-level fields.
Current syntax for requests:
POST _zentity/resolution/ENTITY_TYPE
{
"attributes": { ... },
"terms": [ ... ],
"ids": { ... },
"scope": { ... }
}
Current alternative syntax for requests using an embedded an entity model:
POST _zentity/resolution
{
"attributes": { ... },
"terms": [ ... ],
"ids": { ... },
"scope": { ... },
"model": { ... }
}
Proposed syntax for requests using the "entities"
syntax, which supports separately resolving one or many entities in a single resolution job:
POST _zentity/resolution
{
"entities": [
{
"type": ...,
"attributes": { ... },
"terms": [ ... ],
"ids": { ... }
}
],
"scope": { ... }
}
Propose alternative syntax for requests using embedded entity models:
POST _zentity/resolution
{
"entities": [
{
"attributes": { ... },
"terms": [ ... ],
"ids": { ... },
"model": { ... }
}
],
"scope": { ... }
}
When using the "entities"
syntax, the values of "scope.*.attributes"
and "scope.*.resolvers"
must be prefixed with ENTITY_TYPE:
to
3.2 Resolution API Response
Proposed syntax for responses:
POST _zentity/resolution
{
"took" : INTEGER,
"entities": [
{
"_zid": _ZID,
"_type": ENTITY_TYPE,
"_hits": [ ... ]
},
...
],
"relations": [
{
"_zid": _ZID,
"_type": RELATION_TYPE,
"_direction": RELATION_DIRECTION,
"_a": _ZID,
"_b": _ZID,
"_hits": [
{
"_index": INDEX_NAME,
"_id": DOC_ID
},
...
]
},
...
]
}
The response is a node-link graph structure, where the nodes are listed in the "entities"
field and the links are listed in the "relations"
field:
"entities"
is a list of objects, where each object is an entity with a unique_zid
, a_type
, and a list of_hits
that retains its current syntax."relations"
is a list of objects, where each object is a relation between two entities"a"
and"b"
.
4. Extracting entities from documents
Currently, zentity assumes that everything in a resolution job belongs to a single entity: the attributes for every query submitted to Elasticsearch, and the attributes from every document received from Elasticsearch.
zentity must be able to find all possible entities in the scope of the resolution job. The way it can do this is to check if the document contains non-empty values for every attribute of any resolver for every entity type.
Proposed implementation:
for each doc returned by a query:
for each entity type in the scope of the job:
for each resolver in the model of that entity type:
if the doc contains non-empty values for each attribute in that resolver:
consider the doc as a hit for that entity type, and use it input for subsequent queries
5. Resolving relations
Relations will be defined by the co-occurence of two entities in a document. By default, any co-occurrence of multiple entities in a document will create an untyped, undirected relation between each pair of those entities. Sometimes this might not be desired, and so there should be parameter to disable the creation of relations that aren't described by a user-created relation model.
6. Performing transitive closure
During the life of the resolution job, it's possible that two or more entities could be discovered to be the same entity (see example below). zentity must merge any entities (and their relations) that share transitive connections.
Example:
- User provides inputs for the attributes of entity A.
- zentity submits queries using the input and receives a document that matches entity A and also contains entity B.
- zentity submits queries using the attributes of entity A and receives a document that matches entity A and contains entity C.
- zentity submits queries using the attributes of entity B and receives a document that matches entity B and contains entity D.
- zentity submits queries using the attributes of entity C and receives a document whose
_id
was one of entity A and a document whose_id
was one of entity B.
In this example, zentity should merge entities A, B, and C, because it was shown that C = B and C = A, therefore A = B = C.
How to check for transitivity
zentity will only know to merge the entities if they share an _id
. However, zentity prevents an _id
from ever appearing twice, because zentity has an optimization that excludes every _id
it discovers from subsequent queries in the job (source). This optimization must be applied for each entity rather than globally for the job. Each entity can have its own _id
set to prevent duplicate hits to the same document, while allowing for other entities the chance to overlap with that _id
.
Current structure of job.docIDs:
{
INDEX_NAME: set(DOC_ID, ...),
...
}
Proposed new structure of job.docIDs:
{
_ZID: {
INDEX_NAME: set(DOC_ID, ...),
...
},
...
}
Proposed additional structure to quickly determine if an _id
belongs to two or more entities:
{
DOC_ID: set(_ZID, ...),
...
}
When to perform transitive closure
Transitive closure should run just before the job is believed to have ended. This will limit the number of times that this expensive operation has to run. After transitive closure is complete, if any entities were merged, the job should run another hop of queries with the newly merged entities. Otherwise, if no entities were merged, the job is complete.
How to perform transitive closure
At the end of the job, transitive closure should be applied to the _id
sets of all entities. Whenever two _id
sets share an element, those sets need to be merged, and the attributes of those entities needs to be merged. The _zid
that is the lexicographically lowest of the merged entities will become the _zid
for the newly merged entity.
7. Scoping graph resolution
zentity should be able to scope resolution jobs by entity type and relation type, in addition to the current accepted scope of attributes, resolvers, and indices.
Current syntax for scoping resolution jobs:
{
"scope": {
"exclude": {
"attributes": { ... },
"resolvers": [ ... ],
"indices": [ ... ]
},
"include": {
"attributes": { ... },
"resolvers": [ ... ],
"indices": [ ... ]
}
}
}
Proposed new syntax for scoping resolution jobs:
{
"scope": {
"exclude": {
"entities": {
"attributes": { ... },
"resolvers": [ ... ],
"types": [ ... ]
},
"relations": {
"types": [ ... ]
},
"indices": [ ... ]
},
"include": {
"entities": {
"attributes": { ... },
"resolvers": [ ... ],
"types": [ ... ]
},
"relations": {
"types": [ ... ]
},
"indices": [ ... ]
}
}
}
8. Limiting graph traversal
zentity will need a parameter in the resolution job to limit its searches on linked entities by some number of degrees of separation from the entities in the request.
Current circuit breaker parameters include:
max_docs_per_query
- Maximum number of docs per query result.max_hops
- Maximum level of recursion.max_time_per_query
- Timeout per query.
Proposed changes:
- Add
max_degrees
and default it to1
. - Rename
max_hops
tomax_rounds
(and rename any other instance of "hop" to "round," because most people will envision a "hop" to mean a link from one entity to another, which isn't the purpose of this parameter).
Can we use this feature in the future release of the plugin as it seems quite helpful for graph resolution?