The following two points made me create this document:
- RDFs are everywhere: a large resources like WikiData and DBPedia, based on refinement of Wikipedia, very useful for NLP research.
- Documents for RDF query tools are mess: There are a lot of information about using them spread around the web, which sometimes are erronous. I wanted something simple and handy that I can easily refer to, whenever I need to use it.
For most this I am using the SPARQL query language.
There are many online tools to run your queries:
- DBPedia's internal SPARQL endpoint: http://dbpedia.org/sparql
- WikiData's SPARQL endpoint: https://query.wikidata.org/
- Bio2RDF's SPARQL endpoint: http://bio2rdf.org/sparql
So far YASGUI's been my favorite, generic SPARQL editor, which can be used to query from a desired endpoint.
The prefixes help shorten queries. In other words, instead of using full URLs, we define prefixes for them to make the call shorter. All prefix URLs/URIs that do not contain hostname are prefixed with the hostname of the generating wiki.
Here is an exmple URI, if used directly in the script:
<http://this.is.a/full/URI/written#out>
Instead we defined the following prefix
PREFIX foo: <http://this.is.a/URI/prefix#>
and later in the code we do:
... foo:bar ...
where bar
is a concept/page/entity/etc defined on the target domain defined by foo
.
Often Here are the list of prefixes for DBPedia. Also here is a similar list of WikiData. There is this website to look up important global prefix names.
Variables are indicated by a "?" or "$" prefix. For example:
?var1, ?anotherVar, ?and_one_more
You can add comments in your code, by using the #
prefix:
# This is a comment, ye ye, yo yo, ye ye ...
- Plain literals:
"a plain literal"
- Plain literal with language tag:
“bonjour”@fr
- Typed literal:
"13"^^xsd:integer
- Some of these typed literals have shortcuts; here are some examples:
true
is the same as“true”^^xsd:boolean
3
is the same as“3”^^xsd:integer
4.2
is the same as“4.2”^^xsd:decimal
Important note: SPARQL is case sensitive (because RDF is case sensitive). For example, DBpedia uses the convention that property names are start with a lower case letter (e.g. dbpedia-owl:country for "the country belonging to X is ...") and class names start with an upper case letter (e.g. dbpedia-owl:Country).
These patterns are used to select sets of triples from the RDF database
- Match an exact RDF triple:
ex:myWidget ex:partNumber “XY24Z1” .
- Match one variable:
?person foaf:name “Lee Feigenbaum” .
- Match multiple variables:
conf:SemTech2009 ?property ?value .
(picture from here)
Use SELECT
defines what you want and WHERE
defines your conditions, restrictions, and filters.
For example:
SELECT ?subject ?predicate ?object
WHERE {?subject ?predicate ?object}
LIMIT 100
The SORT
operator can be used to sort the results. The GROUP
keyword can be used to group/cluster the results.
SELECT ?predicate (COUNT(*)AS ?frequency)
WHERE {?subject ?predicate ?obDEject}
GROUP BY ?predicate
ORDER BY DESC(?frequency)
LIMIT 10
- Conjunction operator
A . B
: Join together the results of solvingA
andB
by matching the values of any variables in common. - Left to join
A OPTIONAL { B }
: Join together the results of solvingA
andB
by matching the values of any variables in common, if possible. Keep all solutions from A whether or not there’s a matching solution inB
. - Disjunction
{ A } UNION { B }
: Include both the results of solving A and the results of solvingB
. - Subtraction pattern
A MINUS { B }
: SolveA
. SolveB
. Include only those results from solvingA
that are not compatible with any of the results fromB
.
To get all the people with DBPedia:
select * { ?person a dbo:Person }
limit 100
(try here)
And getting people via WikiData:
SELECT ?person WHERE { ?person wdt:P31 wd:Q5 }
limit 100
(try here)
You may wonder how to combine these results into one single call, i.e. call both DBPedia and WikiData at the same time. This is often referrd to "federated querying". In order to do so, we have to use the SERVICE
keyword to define two end-points:
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?person WHERE {
SERVICE <http://dbpedia.org/sparql> {?person a dbo:Person }
SERVICE <https://query.wikidata.org/sparql> { ?person wdt:P31 wd:Q5 }
} LIMIT 100
(try here)
SELECT ?item ?itemLabel
WHERE
{
?item wdt:P31 wd:Q146 .
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
WikiData has an entry for "Saint Louis University" and an entry for "University". Given these enties (i.e. the WikiData ids), one can ask if one is instanceOf
the other one. (See the side notes at the end of this document, on how to obtain the WikiData ids)
ASK {
wd:Q734774 wdt:P31* wd:Q3918
}
(try here)
We can modify this to query everything that are instanceOf
of "University" (i.e. list of universities).
SELECT ?thing
WHERE {
?thing wdt:P31* wd:Q3918
}
(try here)
Or vice, get the super-types of "Saint Louis University":
SELECT ?thing
WHERE {
wd:Q734774 wdt:P31* ?thing
}
(try here)
which are "University", "Building", "private not-for-profit educational institution".
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT DISTINCT ?city ?country
WHERE { ?city rdf:type dbo:City ;
rdfs:label ?label ;
dbo:country ?country
}
(try here)
Let's say you want to describe Harry Potter.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?type ?superType WHERE
{
# give me ?type of the resource
<http://dbpedia.org/resource/Harry_Potter_(character)> rdf:type ?type .
# give me ?superTypes of ?type
OPTIONAL {
?type rdfs:subClassOf ?superType .
}
}
(try here)
which would yield results like "human", "person", "fictional character", etc.
Similar to the previous exmples, we find the ids for properties "employer" and "educated" and ids for entities "UIUC" and "Harvard", and use the conjunction operator ".":
SELECT ?person
WHERE {
?person wdt:P69 wd:Q13371.
?person wdt:P108 wd:Q457281
}
Now lets you want to get the labels for each of the triples:
SELECT ?person ?personLabel
WHERE {
?person wdt:P69 wd:Q13371.
?person wdt:P108 wd:Q457281
SERVICE wikibase:label {
bd:serviceParam wikibase:language "en" .
}
}
(try here)
Let's continue the example of Harvard graduates by extracting their birthplace and its coordinates. Next we can use the editoro to visualize the results of the coordinates on a Map.
SELECT ?person ?personLabel ?birthPlaceLabel ?coordinates
WHERE {
?person wdt:P69 wd:Q13371.
?person wdt:P19 ?birthPlace.
?birthPlace wdt:P625 ?coordinates .
SERVICE wikibase:label {
bd:serviceParam wikibase:language "en" .
}
}
(run here)
As you can see, most the Hardvard graduates are from east coast, USA. While west of China or central Africal almost have no representatives. Repeating the same thing for UIUC graduates would show that most UIUC gradutes are coming from MidWest, USA, and mostly from Chicago suburbs:
SELECT ?personLabel ?coordinates
WHERE {
?person wdt:P39 wd:Q13218630 .
?person wdt:P172 wd:Q49085 .
?person wdt:P19 ?birthPlace.
?birthPlace wdt:P625 ?coordinates .
SERVICE wikibase:label {
bd:serviceParam wikibase:language "en" .
}
}
(try here)
The SPARQL editor of WikiData also has ability to visualize data as timeline. Here I am visualizing the US presidents according to their date of birth. (try here)
SELECT ?ethLabel (COUNT(*) as ?count)
WHERE {
?person wdt:P39 wd:Q13218630 .
?person wdt:P172 ?eth .
SERVICE wikibase:label {
bd:serviceParam wikibase:language "en" .
}
} GROUP BY ?ethLabel
This is slightly misleading, since as we all know the majority is not African-Americans, but rather among the ones that have "ethnicity" label. In order to add an extra category for the ones that do no have an explicit ethnicity, we can use the OPTIONAL
keyword to define it as optional.
SELECT ?ethLabel (COUNT(*) as ?count)
WHERE {
?person wdt:P39 wd:Q13218630 .
OPTIONAL { ?person wdt:P172 ?eth }.
SERVICE wikibase:label {
bd:serviceParam wikibase:language "en" .
}
} GROUP BY ?ethLabel
which would result in 10806 representatives without ethnicity label.
Essentially finding the shortest common ancestor of A and B (idea from here)
DBPedia
SELECT ?a ?b ?super (?aLength + ?bLength as ?length)
{
values (?a ?b) { (dbo:Person dbo:SportsTeam) }
{
SELECT ?a ?super (COUNT(?mid) as ?aLength) {
?a rdfs:subClassOf* ?mid .
?mid rdfs:subClassOf+ ?super .
}
GROUP BY ?a ?super
}
{
SELECT ?b ?super (COUNT(?mid) as ?bLength) {
?b rdfs:subClassOf* ?mid .
?mid rdfs:subClassOf+ ?super .
}
GROUP BY ?b ?super
}
}
ORDER BY ?length
LIMIT 1
(try here)
For WikiData, one can try RDF GAS API by blazegraph:
PREFIX gas: <http://www.bigdata.com/rdf/gas#>
SELECT ?super (?aLength + ?bLength as ?length) WHERE {
SERVICE gas:service {
gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.SSSP" ;
gas:in wd:Q5 ;
gas:traversalDirection "Forward" ;
gas:out ?super ;
gas:out1 ?aLength ;
gas:maxIterations 10 ;
gas:linkType wdt:P279 .
}
SERVICE gas:service {
gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.SSSP" ;
gas:in wd:Q349 ;
gas:traversalDirection "Forward" ;
gas:out ?super ;
gas:out1 ?bLength ;
gas:maxIterations 10 ;
gas:linkType wdt:P279 .
}
} ORDER BY ?length
LIMIT 1
(try here)
(note: you can query this via json)
There a bunch of libraries that are intended for this; for example:
But my preferred way of using the result is using the POST/GET apis provided by many endpoints. For example, here is a GET api for Wikidata which provides json results:
- WikiData:
https://query.wikidata.org/sparql?format=json&query=PUT-YOUR-QUERY-HERE
for example this. - DBPedia:
https://dbpedia.org/sparql?format=json&default-graph-uri=PUT-YOUR-QUERY-HERE
for example this.
- You can use Wikipedia API to map Wiki page titles to WikiData ids. For example here is the mapping for "Universityr", returned as JSON. Note that given a sentence/paragraph, the right way to map the constituents to their WikiData ids is first disambiguating their Wiki pages and using the mapping through their Wikipedia page ids. For example consider this sentence:
The university president, John Jenkins, described his hope that Notre Dame would become "one of the pre–eminent research institutions in the world" in his inaugural address.
If I use only "Notre Dame" it would give me the id to the disambiguation page, while using the right Wikipedia page "University_of_Notre_Dame" gives me the correct id.
- A comprehensive cheatsheet.
- General syntax guidelines.
- A list of WikiData exmaples.
Send a Pull-Request, or report in the issues! :)