NEST

This repository contains Python code to replicate the experiments for NEST. We propose to use neural models for type prediction and type representation to improve the type enrichment strategies that can be used in existing matching pipelines in a modular fashion. In particular:

type enrichment for type-based filtering: neural type prediction algorithms to enrich the types of candidate entities with types predicted by a neural network.
type enrichment for entity similarity with distributed representations: distributed type representations to enrich entity embeddings and make their similarity more aware of their types.

Reference

This work is under review (ESWC 2021):

Cutrona, V., Puleri, G., Bianchi, F., and Palmonari, M. (2020). NEST: Neural Soft Type Constraints to Improve Entity Linking in Tables. ESWC 2021 (under review).

How to use

Installing dependencies

The code is developed for Python 3.8. Install all the required packages listed in the requirements.txt file.

virtualenv -p python3.8 venv # we suggest to create a virtual environment
source venv/bin/activate
pip install -r requirements.txt

Prepare utils data

Neural networks and type embeddings are available in the utils_data.zip file. The following files must be extracted under the utils/data directory:

abs2vec_pred.keras and abs2vec_pred_classes.pkl: the neural network based on BERT embeddings, and the list of its predictable classes
rdf2vec_pred.keras and rdf2vec_pred_classes.pkl: the neural network based on RDF2Vec embeddings, and the list of its predictable classes
dbpedia_owl2vec: typed embedding for DBpedia 2016-10 generated using OWL2Vec
tee.wv: typed embedding for DBpedia 2016-10 generated using TEE

We release a set of Docker images to run the above predictors as a service; also, some other embedding models (e.g., RDF2Vec) have been exposed as a service. Download abs2vec embeddings from GDrive and set its path in the docker-compose.yml file. Finally, start the containers:

docker-compose up -d

Benchmark datasets

Benchmark datasets can be downloaded from GDrive. Unzip the file under the datasets folder.

Create the index

Replicating our experiments requires to initialize an index that contains DBpedia 2016-10. We created it by using ElasticPedia, then manually adding the Wikipedia anchor texts, labels from the Lexicalization dataset, and the in- and out-degree from the Page Link dataset. Lastly, we re-indexed the index with the following mappings:

{
  "dbpedia": {
    "mappings": {
      "properties": {
        "category": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          }
        },
        "description": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          }
        },
        "direct_type": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          }
        },
        "in_degree": {
          "type": "integer"
        },
        "nested_surface_form": {
          "type": "nested",
          "properties": {
            "surface_form_keyword": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword"
                },
                "ngram": {
                  "type": "text",
                  "analyzer": "my_analyzer"
                }
              }
            }
          }
        },
        "out_degree": {
          "type": "integer"
        },
        "surface_form_keyword": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword"
            },
            "ngram": {
              "type": "text",
              "analyzer": "my_analyzer"
            }
          }
        },
        "type": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          }
        },
        "uri": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          }
        },
        "uri_count": {
          "type": "integer"
        },
        "uri_prob": {
          "type": "float"
        }
      }
    },
    "settings": {
      "index": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "my_tokenizer"
            }
          },
          "tokenizer": {
            "my_tokenizer": {
              "token_chars": [
                "letter",
                "digit"
              ],
              "min_gram": "3",
              "type": "ngram",
              "max_gram": "3"
            }
          }
        }
      }
    }
  }
}

We are planning to release a dump of our index. Replace the host name titan with the endpoint of your Elasticsearch index in the following files:

utils/nn.py
utils/embeddings.py
data_model/kgs.py
run_experiments.py

Run the experiments

Run the script as follows to initialize and run the models described in our paper:

python run_experiments.py

Results are printed in the eswc_experiments.json file.

People

Vincenzo Cutrona, University of Milano - Bicocca (vincenzo.cutrona@unimib.it)
Gianluca Puleri, University of Milano - Bicocca (gianluca.puleri@unimib.it)
Federico Bianchi, Bocconi University (f.bianchi@unibocconi.it)
Matteo Palmonari, University of Milano - Bicocca (matteo.palmonari@unimib.it)

vcutrona/nest