/data-set-knowledge-graph

code for generating a high-quality knowledge graph with metadata about datasets and links to publications

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Data Set Knowledge Graph (DSKG)

Abstract

We present an approach for constructing an RDF knowledge graph for Datasets. To build the knowledge graph, we use datasets registered in OpenAIRE and Wikidata. We identify all publications out of 146 million scientific publications which contain mentions of datasets, and establish links between the dataset and publication representations in the Microsoft Academic Knowledge Graph. As the author names of datasets can be ambiguous, we develop and evaluate a method for author name disambiguation and enrich the knowledge graph with links to ORCID. Overall, our knowledge graph contains 2,208 datasets with associated properties, as well as 813,551 links to scientific publications. It can be used for a variety of scenarios, facilitating advanced dataset search systems and new ways of measuring and awarding the provisioning of datasets. The constructed data set knowledge graph (DSKG) is provided with a SPARQL endpoint and resolvable URIs at http://dskg.org and is also available at Zenodo.

Schema of the DSKG

Schema of the DSKG

The repository provides all the scripts needed to create the knowledge graph semi-automatically. The following manuel explains how to create the knowledge graph.

Knowledge Graph Construction

We use the following database with metadata about datasets for the creation of the DSKG:

  1. OpenAIRE-Dataset: We consider a subset of the OpenAIRE Research Graph dump which contains metadata about datasets. The used dump is created with this code: https://github.com/michaelfaerber/OpenAIRE.
  2. Wikidata-Dataset: We use instances of the classes of Wikidata which represent datasets. The instances of the relevant classes and their properties can be accessed based on semantic querys via the publicly available Wikidata SPARQL endpoint.

Identify publications from the MAG which contain mentions of datasets

We use a string-based algorithm to detect mentions of datasets in papers. We use the files containing the paper abstracts and citation context of the MAG-dump for the matching. For dataset from OpenAIRE, we use the following metadata information to recognize dataset in the files: title, originalId and doi. For dataset from Wikidata, we use: itemLabel, altLabel, officialWebsite, workURL and url.

  1. The first step is to filter out the most frequently used English words for the match. The following Script calculates this not considered intersection: match_text_corpus.py.
  2. For then run the script for the matching. MAG dumps are used as input and the output are text files with the matches found: string_based_matching_MAKG.py.
  3. After that, the results are filtered using the created filter list to reduce false matches: filterwords_matching.py.
  4. The following script inserts the found matches into these initial datasets (csv-files) of the OpenAIRE and Wikidata dataset: MAKG_links_in_csv.py. In the following we will only use the metadata records for which at least one link could be found.

Transform tabular metadata to RDF and assign URIs for entities

We implemented the data transformation of the original metadata using SPARQL CONSTRUCT and SPARQL INSERT querys in Ontotext's GraphDB graph database.

  1. Clean up the OpenAIRE dataset (csv-file) entries and adapt the metadata entries of the size property to DCAT: preprocessing_OpenAIRE.py.
  2. Perfom the classification of the metadata entries for OpenAIRE and Wikidata according to DCAT: classification_resources.py.
  3. In GraphDB: Create beta version of the DSKG where the properties are mapped to DCAT but no URIs for the resources are assigned yet. The creation of the dskg-beta-version is realized with SPARQL CONSTRUCT and INSERT (SPARQL_CONSTRUCT_openAIRE_beta_version.txt and SPARQL_CONSTRUCT_wikidata_beta_version.txt) querys for the OpenAIRE and Wikidata dataset in tabular form (csv-files). For the further steps, the beta version of the dskg is compiled as a table form using a SPARQL SELECT query in GraphDB.
  4. Use the file PaperFieldsOfStudy.txt from the MAG-dump, the dskg-beta-version and the Jupyter Notebook fields_of_application.ipynb to determine the fields of applications of the datasets and add it to the dskg-beta-version.
  5. Perform the author disambiguation explained in the paragraph below.
  6. Assignment of unique URIs for the entities in the dskg-beta-version (uses the results of the performed author disambiguation): assign_uris_for_entities.py.
  7. Load the enriched information from the dskg-beta version into the classified OpenAIRE and Wikidata dataset for the final construction of the knowledge graph. Create csv-files from the dskg-beta-version for each classes of entities in the metadata: final_csv_files_transformation_dcat.ipynb.
  8. Load the generated csv-files into a GraphDB Repository and transform the table data into RDF using the SPARQl CONSTRUCT and SPARQl INSERT querys to construct the final DSKG.

Note on using SPARQL CONSTRUCT and INSERT querys in GraphDB: The SPARQL INSERT querys are identical to the CONSTRUCT querys, except for the replacement of the keyword (INSERT instead of CONSTRUCT, the removal of the LIMIT 100 restriction and the addition of the corresponding SPARQL endpoint within the WHERE clause: WHERE { SERVICE <ontorefine:99999999999> {...} }. <ontorefine:99999999999> is an example for a SPARQL endpoint in GraphDB.

Author Disambiguation

  1. Perform a SPARQL Query over the dskg-beta-version to get a table with the relevant information of the datasets for the LDA model.
  2. Calculate the LDA vectors for the datasets and load it into the dskg-beat-version for the author disambiguation with the Jupyter Notebook LDA-Modell.ipynb.
  3. Perfom the author disambiguation with the Jupyter Notebook author_disambiguation.ipynb. Use the dskg-beta-version from the LDA model as input. The Code first creates a txt-file that contains all the necessary information for the author disambiguation which is then used to perform the author disambiguation.

Linking the authors of the DSKG to ORCID

  1. Perform a SPARQL Query over the DSKG to get a table (csv-file) with the author profiles from the knowledge graph.
  2. Query the titles of the linked papers using the MAKG SPARQL endpoint: 02MAKG_paper_titels.py.
  3. Query of author names via the ORCID API: 03ORCID_API.py.
  4. Perform the linking to ORCID by running the Script that compares the author profiles: 04linking_authors_to_orcid.py.
  5. Insert the found ORCID IDs of the authors into the csv-file which contains the metadata of the authors: 05add_ORCID_IDs_to_csv.py.
  6. Add the ORCID-IDs to the knowledge graph in GraphDB using SPARQL CONSTRUCT and SPARQL INSERT.

Demo

See http://dskg.org/.

Contact

The system has been designed and implemented by Michael Färber and David Lamprecht. Feel free to reach out to us:

Michael Färber, michael.faerber@kit.edu

How to Cite

Please cite our paper as follows:

@article{Faerber2021DSKG,
  author    = {Michael F{\"{a}}rber and
               David Lamprecht},
  title     = "{The data set knowledge graph: Creating a linked open data source for data sets}",
  journal   = {Quantitative Science Studies}, 
  publisher = {MIT Press}, 
  volume    = {2},
  number    = {4},
  pages     = {1324-1355},
  year      = {2021},
  issn      = {2641-3337},
  doi       = {10.1162/qss_a_00161},
  url       = {https://doi.org/10.1162/qss\_a\_00161}
}