DBpedia Graph Extractor

A simple tool for extracting labeled graphs from DBpedia. The intended purpose is for testing machine-learning algorithms.

Graph Structure

The graph is created by specifying a list of DBpedia ontology types. Any DBpedia resource belonging to one of the specified types will become a node in the graph. (Directed) edges are defined by Wikipedia page links between corresponding DBpedia resources.

Files

This repository is organized into two folders:

raw/: Primarily used as a store for files downloaded from DBpedia. Contains a script for downloading required raw data files from DBpedia (download_files.sh).
scripts/: Contains scripts for processing downloaded DBpedia files to extract desired graph:
- process_data.sh: A bash script that (if required) downloads required files from DBpedia, decompresses them via Unix named pipes, and calls the awk script below.
- process_data.awk: An awk script to process the DBpedia files and extract the desired graph. All the actual data processing occurs here.

Usage

The process of creating a new graph is simple:

Create a file containing the DBpedia ontology types desired. An example has been provided in ontology_types.example:
```
 <http://dbpedia.org/ontology/AdministrativeRegion>
 <http://dbpedia.org/ontology/Country>
 <http://dbpedia.org/ontology/City>
 <http://dbpedia.org/ontology/Town>
 <http://dbpedia.org/ontology/Village>
```
These correspond to the types used to create the "populated places" datasets in the following paper:

Neumann, M., Garnett, R., and Kersting, K. Coinciding Walk Kernels: Parallel Absorbing Random Walks for Learning with Graphs and Few Labels. (2013). To appear in: Proceedings of the 5th Annual Asian Conference on Machine Learning (ACML 2013).
Edit the scripts/process_data.sh file and edit the following variables (defined at the top of the file), if desired: * PROCESSED_DIRECTORY: Where to store the created graph. * ONTOLOGY_TYPES_FILE: A list of the DBpedia ontology types to use.
Run the scripts/process_data.sh script.

Output

The scripts/process_data.awk file will output four files:

edge_list: A list of edges corresponding to Wikipedia page links between the extracted nodes.

Format: [from node id] [to node id]
labels: A list of integer labels associated with the extracted nodes. The _i_th line of this file is the label associated with node id i.
label_ids_to_labels: A map from created integer label ids to the provided ontology types.

Format: [label id] [ontology type name]
node_ids_to_names: A map from created node ids to the corresponding DBpedia resource names.

Format: [node id] [DBpedia resource name]

Notes

The raw/download_files.sh script requires curl.
By default, the 3.9 release of DBpedia will be used (current as of this release). If desired, this can be overridden by modifying the DBPEDIA_VERSION variable in raw/download_files.sh.
The required DBpedia files will be downloaded to the raw directory. These files are not deleted by default, to enable the creation of multiple graphs without having to download the files again. These files are rather large, however, and you might want to remove them when you're done.
The DBpedia datasets used are dual-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC BY-SA 3.0) and the GNU Free Documentation License. Using this tool will create derivative works that are subject to the conditions of the license of your choice. The code itself is licensed under the MIT license (see LICENSE for the full text).

rmgarnett/dbpedia_graph_extractor