NumTab
The following code is used for the creation of a dataset focusing on Numerical Values for, among other things, disambiguation of CSV tables to RDF. This version of NumTab was used to evaluate DSL and MSL.
We base our approach on a given number of properties from DBpedia and disambiguate those to Wikidata properties, semi-automatically, a detailed descritption of that process is given in the [Mapping section](## Mapping). We use values from Wikidata and keep the DBpedia properties. We limit the number of bags of values (columns) per property to a maximum of twenty.
The created files can be found in gold/numtab.txt, where each DBpedia property is followed by the number of rows of numerical values and the respective Wikidata values. In gold/files/files-csv each bag of values gets a CSV file with an abritary ID and the property URI, encoded with + for each / in the URI. The files contain only the values. In those files, values for the property routeNumber
are missing since they mix numerical and textual values.
Specifications of NumTab
Metrics | |
---|---|
Properties used | 22 |
Max occurrences/property | 20 |
Total bag of values | 224 |
Median values/bag | 216 |
Min values/bag | 5 |
Min values/bag | 4552 |
Used Properties and occurence
Using the code
Mapping and generation of Numtab can be run independently.
Requirements
- Python 2.7.12
- SPARQLWrapper and pandas available via pip
- The dump used for the dataset included in the repo can be found here
Useful links
- Source code for DSL is available here along with the datasets used for evaluation
- Source code for MSL is available here
Mapping
The mapping takes approximately 80 GB of memory (since the dump is loaded in memory) and 10 GB for the file created.
In order to do the mapping more efficently, we create a dump that's purely numerical with the script mapping/numerical-dump.py. You will need to download a .nt.bz2 file, containing a Wikidata dump provided by Wikimedia. The created text file will contain a python dictonary with a version of the triples, that have numerical objects. Based on that mapping/map-properties.py maps DBpedia properties to Wikidata, assumes Wikidata and DBpedia share a set of triples. It reads from properties.csv and writes the mapping to wdproperties.py
tl;dr
- Download Wikidata *-truthy-BETA.nt.bz2 * dump from here
- run numerical-dump.py (careful, will eat lots of memory)
- run map-properties.py (careful, will eat lots of memory)
- currently writes to properties/wd-properties.py and properties/log.txt in a very bad format, adjust this file so it looks like properties-unambigious.txt (we used the log file to adjust and count most occurences etc.)
Generate dataset
make-dataset.py creates the dataset. Using wd-query.py it calls the Wikidata endpoint to get classes used by items using the Wikidata properties in wdproperties.py. The resulting triples are passed to data-structure.py, which sorts them by property-object pairs to limit the possible triples to 50. create-json.py will write the resulting dataset and its annotation to a json file that can be used.
tl;dr
- make sure properties/properties-unambiguous.txt exists
- run make-dataset.py
Mapping
NumTab utilizes the fact that many DBpedia items are connected to Wikidata by owl:sameAs
. However, the properties of those two ontologies are only partially connected. We chose to connect the properties with a very simplistic approach: For a given set of properties we check which triples have the same subject and object in both Wikidata and DBpedia. To test this, we chose 46 properties provided in properties/properties.csv.
There are quite a few limitations to map properties like this. Usually, the scope of Wikidata properties is much broader. For example, the DBpedia properties foundingYear
, yearOfConstruction
, openingYear
, and formationYear
can all be summarized by the Wikidata property inception (P571
). Very broad properties such as year
in DBpedia do not have an equivalent in Wikidata. Beside being only able to look for values that have a subject that is disambiguated to Wikidata in DBpedia, another limitation is the way how statements in Wikidata are formed compared to Wikidata. Those rather broad properties in Wikidata are usually limited by so called qualifier, that make a statement valid for a certain condition. So can for example multiple population numbers for a city be limited in their validation to a certain time frame (using start time (P580
) and end time (P582
). Another example is the DBpedia property service start year
. While DBpedia describes such cases with a dedicated property, Wikidata makes use of its properties to describe the enrolment in a service and qualifies it with an end and start time.
Building the dataset
To build NumTab, we used the automatic mappings between Wikidata and DBpedia, enriched by manual mappings, which can be found in properties/properties-all.
We limited them to only those properties that we could unambiguously match to DBpedia. Therefore we base our work on those P2048
) and are of the same instance of (P31
), such as sculpture (Q860861
). We look into more connections.
That means, we cluster all triples, that use the given properties by their types. We limit the classes, to such that have between 100 and 5000 instances using Wikidata's SPARQL endpoint and select 20 of those clusters. For each cluster, we look at the most common property-object pair. In the previous mentioned example of sculptures, that could be all sculptures (Q860861
) that are part of the movement (P135
) Art Nouveau (Q34636
). From all items sharing the most common property-object pair, we extract the numerical value for the one of our 17 properties and add them to a column labelled by the corresponding DBpedia property.
This dataset NumTab is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.