find page from Wikidata title
lm913 opened this issue · 12 comments
Example: shoulder bag
When I use page = wptools.page('shoulder bag')
wptools returns the Wikipedia disambiguation page, however, when searching directly in Wikidata there are multiple entries for 'shoulder bag'
The results I am looking for is the first in the list "shoulder bag bag with one or two long straps, designed to be suspende..."
I can't seem to find a way to get wptools to return the result I'm looking for. Does anyone have additional insight in this?
Thanks for trying wptools @lm913!
The wikidata item you want (Q29486521) can be got with wikibase
like this:
>>> import wptools
>>> page = wptools.page(wikibase='Q29486521')
>>> page.get()
as shown in our Usage wiki page.
However, you'll notice that the wikidata item in question (https://www.wikidata.org/wiki/Q29486521) has no Wikipedia entries.
Does that make sense?
Thank you for your quick response!
Let me describe my use case more. I have around 6000 words that I want to get the Q-IDs for and wanted to use a script to query on the word, not the Q-ID, since I only have the word as a starting point.
I suppose I could flag all the words that return a Wikipedia disambiguation page and look into them further but wanted to know if there was a way, using a word query, to identify when there are multiple results and step-through them programmatically.
I'm not sure what the presence of a Wikipedia entry indicates. Does the search favour Wikidata results that have a Wikipedia entry?
One way I can think of is you can check the instance of
property in the page, and to see if this is a disambiguation page, and then looking up the pages in the wikitext.
if page.data['claims']['P31'] == [u'Q4167410']:
for p in page.data['links']: # [u'Handbag', u'Messenger bag', u'Satchel (bag)']
# do something with the each link here
But this way I don't think it can get to the actual item Q29486521
represent the shoulder bag
Thanks @lisongx I looked into that earlier but unfortunately doesn't show the 'shoulder bag' specifically which does exist in Wikidata
@lm913
I think adding search by wikidata it's not that hard, but it's maybe something not already in the package? @siznax Steve would know
Maybe you could try directly search wikimedia api for wiikidata, and then doing other stuff with the wikdiata id you get from there,
For example, if you request this URL,
https://www.wikidata.org/w/api.php?action=wbsearchentities&search=shoulder+bag&format=json&language=en&uselang=en&type=item
this would return all the entity with that label you want
"search": [
{
"repository": "",
"id": "Q29486521",
"concepturi": "http://www.wikidata.org/entity/Q29486521",
"title": "Q29486521",
"pageid": 31133085,
"url": "//www.wikidata.org/wiki/Q29486521",
"label": "shoulder bag",
"description": "bag with one or two long straps, designed to be suspended from one shoulder or across the chest",
"match": {
"type": "label",
"language": "en",
"text": "shoulder bag"
}
},
{
"repository": "",
"id": "Q7502696",
"concepturi": "http://www.wikidata.org/entity/Q7502696",
"title": "Q7502696",
"pageid": 7415866,
"url": "//www.wikidata.org/wiki/Q7502696",
"label": "Shoulder bag",
"description": "Wikipedia disambiguation page",
"match": {
"type": "label",
"language": "en",
"text": "Shoulder bag"
}
}]
Side issue:
shoulder bag (desired Q-ID): Q29486521
shoulder bag (disambiguation Q-ID): Q7502696
Query code:
import wptools
page = wptools.page(wikibase='Q29486521')
page.get()
Output:
www.wikidata.org (wikidata) Q29486521
www.wikidata.org (labels) P2670|Q1973949|P1014|P373|Q1323314|P18|...
Note: Wikidata item Q29486521 missing 'instance of' (P31)
en.wikipedia.org (query) shoulder_bag
en.wikipedia.org (parse) 17611419
en.wikipedia.org (restbase) /page/summary/Shoulder bag
en.wikipedia.org (imageinfo) File:Tumi mens shoulder bag valentine.jpg
Shoulder bag (en) data
{
WARNINGS: <dict(1)> extracts
assessments: <dict(1)> Disambiguation
claims: <dict(6)> P279, P18, P373, P2670, P1014, P3832
description: <str(95)> Disambiguation page providing links to to...
disambiguation: 3
exhtml: <str(243)> <p><b>Shoulder bag</b> may refer to:</p><ul><...
exrest: <str(193)> Shoulder bag may refer to:Handbag, a bag typi...
extext: <str(209)> **Shoulder bag** may refer to: * Handbag, a ...
extract: <str(243)> <p><b>Shoulder bag</b> may refer to:</p><ul>...
image: <list(1)> {'file': 'File:Tumi mens shoulder bag valentine...
label: Shoulder bag
labels: <dict(8)> P2670, Q1973949, P1014, P373, Q1323314, P18, P...
length: 248
links: <list(3)> Handbag, Messenger bag, Satchel (bag)
modified: <dict(2)> wikidata, page
pageid: 17611419
parsetree: <str(302)> <root>'''Shoulder bag''' may refer to:* [[...
random: Niewiadoma, Masovian Voivodeship
redirects: <list(1)> {'pageid': 30089288, 'ns': 0, 'title': 'Sho...
requests: <list(6)> wikidata, labels, query, parse, restbase, im...
title: Shoulder_bag
url: https://en.wikipedia.org/wiki/Shoulder bag
url_raw: https://en.wikipedia.org/wiki/Shoulder bag?action=raw
wikibase: Q7502696
wikidata: <dict(6)> subclass of (P279), image (P18), Commons cat...
wikidata_pageid: 31133085
wikidata_url: https://www.wikidata.org/wiki/Q7502696
wikitext: <str(243)> '''Shoulder bag''' may refer to:* [[Handbag...
}
the page.results.data
for querying Q-ID: Q29486521 (desired shoulder bag) returns data for the Wikipedia Disambiguation Page (Q7502696)
Q29486521
doesn't have any Wikipedia page related to this, I guess wptools
still trying to search that based the label shoulder bag
somehow.
But from the above, I think the claims are correct, right?
In [9]: page.data['claims']
Out[9]:
{u'P1014': [u'300216945'],
u'P18': [u'Tumi mens shoulder bag valentine.jpg'],
u'P2670': [u'Q1973949'],
u'P279': [u'Q1323314'],
u'P373': [u'Shoulder bags'],
u'P3832': [u'10141']}
Would you be able to use this data from wikidata to accomplish your task?
(I notice page.data['wikibase']
is different from the one we passed in, you may use that as a rule for testing this case, but maybe we shouldn't do this, as this seem a bit non-intuitive cc @siznax )
Thanks for your help Sean @lisongx ! It's true, we do not currently support Wikidata's wbsearchentities
action, and I hope we can avoid doing that.
@lm913 I believe this result is still self-consistent. Note the page.data['redirects']
item:
[{'pageid': 30089288, 'ns': 0, 'title': 'Shoulder bag (disambiguation)'}]
This is telling us that we got redirected from our original query. The full result may be misleading because the final wikibase item (Q) is different than the input, but that is because of the redirect.
The page.get()
method is a convenience method for a mix of other page.get_
methods that might get us as much info as possible. You can see which requests were made in the page.data['requests']
attribute:
>>> page.data['requests']
['wikidata', 'labels', 'query', 'parse', 'restbase', 'imageinfo']
So, we did a get_wikidata
, then get_labels
, etc. If the result is confusing, the first thing to do is to isolate the request actions you want to perform. If you want data for only Q29486521
, then use page.get_wikidata()
:
>>> page = wptools.page(wikibase='Q29486521')
>>> page.get_wikidata()
www.wikidata.org (wikidata) Q29486521
www.wikidata.org (labels) P1014|P279|P373|P18|P3832|Q1323314|Q197...
Note: Wikidata item Q29486521 missing 'instance of' (P31)
en.wikipedia.org (imageinfo) File:Tumi mens shoulder bag valentine.jpg
shoulder bag (en) data
{
claims: <dict(6)> P279, P18, P373, P2670, P1014, P3832
description: <str(95)> bag with one or two long straps, designed...
image: <list(1)> {'file': 'File:Tumi mens shoulder bag valentine...
label: shoulder bag
labels: <dict(8)> P1014, P279, P373, P18, P3832, Q1323314, Q1973...
modified: <dict(1)> wikidata
requests: <list(3)> wikidata, labels, imageinfo
title: shoulder_bag
wikibase: Q29486521
wikidata: <dict(6)> subclass of (P279), image (P18), Commons cat...
wikidata_pageid: 31133085
wikidata_url: https://www.wikidata.org/wiki/Q29486521
}
>>> page.data['wikidata']
{'AAT ID (P1014)': '300216945',
'Commons category (P373)': 'Shoulder bags',
'Europeana Fashion Vocabulary ID (P3832)': '10141',
'has parts of the class (P2670)': 'shoulder strap (Q1973949)',
'image (P18)': 'Tumi mens shoulder bag valentine.jpg',
'subclass of (P279)': 'bag (Q1323314)'}
Our Wiki—it's small—page https://github.com/siznax/wptools/wiki/Request-actions explains our request actions and https://github.com/siznax/wptools/wiki/Wikidata explains how we get Wikidata.
tl;dr: you should be able to get what you need by chaining together the right page.get_
calls. If not, then let us know.
@lm913 After reading your use case again (#145 (comment)) I realize now what the confusion is: you want Wikidata items from Wikidata titles, but wptools does not search Wikidata, it only fetches Wikidata with an item id (Q). Searching Wikidata for the right Q item can be done directly from the Wikidata API:
https://www.wikidata.org/w/api.php?action=wbsearchentities&language=en&search=shoulder%20bag
I don't think wptools can add any value by simply showing you that result. However, once you find the Wikidata item (Q) you want, then you can use wptools to get the Wikidata you want into a python object.
On the other hand, it may be useful to show search results for a title from both Wikipedia and Wikidata. What would a useful search command look like? Maybe...
>>> import wptools
>>> wptools.search('shoulder bag')
{
"wikidata": <Wikidata wbsearchentities result>,
"wikipedia": <Wikipedia opensearch result>
}
Wikidata result:
https://www.wikidata.org/w/api.php?action=wbsearchentities&language=en&search=shoulder%20bag
Wikipedia result:
https://en.wikipedia.org/w/api.php?action=opensearch&search=shoulder%20bag
We'd probably want to transform the results so they are consistent and minimal. Maybe...
{
"wikidata": [{id, label, description}, ...],
"wikipedia" [title, ...]
}
@lm913 would that help solve your problem?
@siznax @lisongx @lm913 I just discovered wptools and was looking for the same functionality.
I'd like to easily search wikidata for some text substring, i.e. wptools.search(substring='Michael Jordan')
.
And I'd like to get back a list of items with they Qids, similar to what @siznax propose (though I am mostly interested in wikidata, not in wikipedia).
I would suggest adding some optional parameters to wptools.search()
:
-
a list of what to
search
(i.e. 'wikidata', 'wikipedia', ...). If empty, then search everything (or whatever default you prefer). -
a dictionary of multiple Property:Value
conditions
that items should match to be included in the returned list:
i.e.wptools.search(substring="Michael", conditions={'P31':'Q5', 'P106':'Q3665646'})
(search all humans that worked as basketball players, containing 'Michael' in their wikidata labels) -
a boolean
details
(defaultFalse
), so that each item in the returned list contains also the full list of conditions it meets:
wptools.search(substring='Michael', search=['wikidata'], conditions={'P31':'Q5', 'P106':'Q3665646'}, details=True)
That should return details (instance of human, sex, citizenship, marriages ...) for each 'Michael-labelled' basketball player:
{
"wikidata": [
{'id':'Q41421', 'label':'Michael Jordan', 'description':'American basketball player and businessman (born 1963)',
'details':{'P31':'Q5', 'P21':'Q6581097', 'P27':'Q30', 'P26':['Q26220952','Q26220958'], ...} },
{...},
...
],
}
Not sure about how to deal with multiple languages. Maybe there should be also a parameter for that.
Sorry, I am by no means an expert on wikidata internals so I don't know if all this makes sense.
I don't even know if substring search should be made on labels or titles (I don't quite understand difference between them, if any).
Regards
@abubelinha