siznax/wptools

find page from Wikidata title

lm913 opened this issue · 12 comments

lm913 commented

Example: shoulder bag

When I use page = wptools.page('shoulder bag') wptools returns the Wikipedia disambiguation page, however, when searching directly in Wikidata there are multiple entries for 'shoulder bag'

image

The results I am looking for is the first in the list "shoulder bag bag with one or two long straps, designed to be suspende..."

I can't seem to find a way to get wptools to return the result I'm looking for. Does anyone have additional insight in this?

Thanks for trying wptools @lm913!

The wikidata item you want (Q29486521) can be got with wikibase like this:

>>> import wptools
>>> page = wptools.page(wikibase='Q29486521')
>>> page.get()

as shown in our Usage wiki page.

However, you'll notice that the wikidata item in question (https://www.wikidata.org/wiki/Q29486521) has no Wikipedia entries.

Does that make sense?

lm913 commented

Thank you for your quick response!

Let me describe my use case more. I have around 6000 words that I want to get the Q-IDs for and wanted to use a script to query on the word, not the Q-ID, since I only have the word as a starting point.

I suppose I could flag all the words that return a Wikipedia disambiguation page and look into them further but wanted to know if there was a way, using a word query, to identify when there are multiple results and step-through them programmatically.

I'm not sure what the presence of a Wikipedia entry indicates. Does the search favour Wikidata results that have a Wikipedia entry?

One way I can think of is you can check the instance of property in the page, and to see if this is a disambiguation page, and then looking up the pages in the wikitext.

if page.data['claims']['P31'] == [u'Q4167410']:
    for p in page.data['links']:  # [u'Handbag', u'Messenger bag', u'Satchel (bag)']
    # do something with the each link here

But this way I don't think it can get to the actual item Q29486521 represent the shoulder bag

lm913 commented

Thanks @lisongx I looked into that earlier but unfortunately doesn't show the 'shoulder bag' specifically which does exist in Wikidata

@lm913
I think adding search by wikidata it's not that hard, but it's maybe something not already in the package? @siznax Steve would know

Maybe you could try directly search wikimedia api for wiikidata, and then doing other stuff with the wikdiata id you get from there,

For example, if you request this URL,

https://www.wikidata.org/w/api.php?action=wbsearchentities&search=shoulder+bag&format=json&language=en&uselang=en&type=item

this would return all the entity with that label you want

  "search": [
    {
      "repository": "",
      "id": "Q29486521",
      "concepturi": "http://www.wikidata.org/entity/Q29486521",
      "title": "Q29486521",
      "pageid": 31133085,
      "url": "//www.wikidata.org/wiki/Q29486521",
      "label": "shoulder bag",
      "description": "bag with one or two long straps, designed to be suspended from one shoulder or across the chest",
      "match": {
        "type": "label",
        "language": "en",
        "text": "shoulder bag"
      }
    },
    {
      "repository": "",
      "id": "Q7502696",
      "concepturi": "http://www.wikidata.org/entity/Q7502696",
      "title": "Q7502696",
      "pageid": 7415866,
      "url": "//www.wikidata.org/wiki/Q7502696",
      "label": "Shoulder bag",
      "description": "Wikipedia disambiguation page",
      "match": {
        "type": "label",
        "language": "en",
        "text": "Shoulder bag"
      }
    }]
lm913 commented

This is a great solution @lisongx ! Thank you so much :)

lm913 commented

Side issue:

shoulder bag (desired Q-ID): Q29486521
shoulder bag (disambiguation Q-ID): Q7502696

Query code:

import wptools
page = wptools.page(wikibase='Q29486521')
page.get()

Output:

www.wikidata.org (wikidata) Q29486521
www.wikidata.org (labels) P2670|Q1973949|P1014|P373|Q1323314|P18|...
Note: Wikidata item Q29486521 missing 'instance of' (P31)
en.wikipedia.org (query) shoulder_bag
en.wikipedia.org (parse) 17611419
en.wikipedia.org (restbase) /page/summary/Shoulder bag
en.wikipedia.org (imageinfo) File:Tumi mens shoulder bag valentine.jpg
Shoulder bag (en) data
{
  WARNINGS: <dict(1)> extracts
  assessments: <dict(1)> Disambiguation
  claims: <dict(6)> P279, P18, P373, P2670, P1014, P3832
  description: <str(95)> Disambiguation page providing links to to...
  disambiguation: 3
  exhtml: <str(243)> <p><b>Shoulder bag</b> may refer to:</p><ul><...
  exrest: <str(193)> Shoulder bag may refer to:Handbag, a bag typi...
  extext: <str(209)> **Shoulder bag** may refer to:  * Handbag, a ...
  extract: <str(243)> <p><b>Shoulder bag</b> may refer to:</p><ul>...
  image: <list(1)> {'file': 'File:Tumi mens shoulder bag valentine...
  label: Shoulder bag
  labels: <dict(8)> P2670, Q1973949, P1014, P373, Q1323314, P18, P...
  length: 248
  links: <list(3)> Handbag, Messenger bag, Satchel (bag)
  modified: <dict(2)> wikidata, page
  pageid: 17611419
  parsetree: <str(302)> <root>'''Shoulder bag''' may refer to:* [[...
  random: Niewiadoma, Masovian Voivodeship
  redirects: <list(1)> {'pageid': 30089288, 'ns': 0, 'title': 'Sho...
  requests: <list(6)> wikidata, labels, query, parse, restbase, im...
  title: Shoulder_bag
  url: https://en.wikipedia.org/wiki/Shoulder bag
  url_raw: https://en.wikipedia.org/wiki/Shoulder bag?action=raw
  wikibase: Q7502696
  wikidata: <dict(6)> subclass of (P279), image (P18), Commons cat...
  wikidata_pageid: 31133085
  wikidata_url: https://www.wikidata.org/wiki/Q7502696
  wikitext: <str(243)> '''Shoulder bag''' may refer to:* [[Handbag...
}

the page.results.data for querying Q-ID: Q29486521 (desired shoulder bag) returns data for the Wikipedia Disambiguation Page (Q7502696)

@lm913

Q29486521 doesn't have any Wikipedia page related to this, I guess wptools still trying to search that based the label shoulder bag somehow.

But from the above, I think the claims are correct, right?

In [9]: page.data['claims']
Out[9]:
{u'P1014': [u'300216945'],
 u'P18': [u'Tumi mens shoulder bag valentine.jpg'],
 u'P2670': [u'Q1973949'],
 u'P279': [u'Q1323314'],
 u'P373': [u'Shoulder bags'],
 u'P3832': [u'10141']}

Would you be able to use this data from wikidata to accomplish your task?
(I notice page.data['wikibase'] is different from the one we passed in, you may use that as a rule for testing this case, but maybe we shouldn't do this, as this seem a bit non-intuitive cc @siznax )

Thanks for your help Sean @lisongx ! It's true, we do not currently support Wikidata's wbsearchentities action, and I hope we can avoid doing that.

@lm913 I believe this result is still self-consistent. Note the page.data['redirects'] item:

[{'pageid': 30089288, 'ns': 0, 'title': 'Shoulder bag (disambiguation)'}]

This is telling us that we got redirected from our original query. The full result may be misleading because the final wikibase item (Q) is different than the input, but that is because of the redirect.

The page.get() method is a convenience method for a mix of other page.get_ methods that might get us as much info as possible. You can see which requests were made in the page.data['requests'] attribute:

>>> page.data['requests']
['wikidata', 'labels', 'query', 'parse', 'restbase', 'imageinfo']

So, we did a get_wikidata, then get_labels, etc. If the result is confusing, the first thing to do is to isolate the request actions you want to perform. If you want data for only Q29486521, then use page.get_wikidata():

>>> page = wptools.page(wikibase='Q29486521')

>>> page.get_wikidata()
www.wikidata.org (wikidata) Q29486521
www.wikidata.org (labels) P1014|P279|P373|P18|P3832|Q1323314|Q197...
Note: Wikidata item Q29486521 missing 'instance of' (P31)
en.wikipedia.org (imageinfo) File:Tumi mens shoulder bag valentine.jpg
shoulder bag (en) data
{
  claims: <dict(6)> P279, P18, P373, P2670, P1014, P3832
  description: <str(95)> bag with one or two long straps, designed...
  image: <list(1)> {'file': 'File:Tumi mens shoulder bag valentine...
  label: shoulder bag
  labels: <dict(8)> P1014, P279, P373, P18, P3832, Q1323314, Q1973...
  modified: <dict(1)> wikidata
  requests: <list(3)> wikidata, labels, imageinfo
  title: shoulder_bag
  wikibase: Q29486521
  wikidata: <dict(6)> subclass of (P279), image (P18), Commons cat...
  wikidata_pageid: 31133085
  wikidata_url: https://www.wikidata.org/wiki/Q29486521
}

>>> page.data['wikidata']
{'AAT ID (P1014)': '300216945',
 'Commons category (P373)': 'Shoulder bags',
 'Europeana Fashion Vocabulary ID (P3832)': '10141',
 'has parts of the class (P2670)': 'shoulder strap (Q1973949)',
 'image (P18)': 'Tumi mens shoulder bag valentine.jpg',
 'subclass of (P279)': 'bag (Q1323314)'}

Our Wiki—it's small—page https://github.com/siznax/wptools/wiki/Request-actions explains our request actions and https://github.com/siznax/wptools/wiki/Wikidata explains how we get Wikidata.

tl;dr: you should be able to get what you need by chaining together the right page.get_ calls. If not, then let us know.

@lm913 After reading your use case again (#145 (comment)) I realize now what the confusion is: you want Wikidata items from Wikidata titles, but wptools does not search Wikidata, it only fetches Wikidata with an item id (Q). Searching Wikidata for the right Q item can be done directly from the Wikidata API:

https://www.wikidata.org/w/api.php?action=wbsearchentities&language=en&search=shoulder%20bag

I don't think wptools can add any value by simply showing you that result. However, once you find the Wikidata item (Q) you want, then you can use wptools to get the Wikidata you want into a python object.

On the other hand, it may be useful to show search results for a title from both Wikipedia and Wikidata. What would a useful search command look like? Maybe...

>>> import wptools
>>> wptools.search('shoulder bag')
{
    "wikidata": <Wikidata wbsearchentities result>,
    "wikipedia": <Wikipedia opensearch result>
}

Wikidata result:
https://www.wikidata.org/w/api.php?action=wbsearchentities&language=en&search=shoulder%20bag

Wikipedia result:
https://en.wikipedia.org/w/api.php?action=opensearch&search=shoulder%20bag

We'd probably want to transform the results so they are consistent and minimal. Maybe...

{
  "wikidata": [{id, label, description}, ...],
  "wikipedia" [title, ...]
}

@lm913 would that help solve your problem?

@siznax @lisongx @lm913 I just discovered wptools and was looking for the same functionality.

I'd like to easily search wikidata for some text substring, i.e. wptools.search(substring='Michael Jordan').
And I'd like to get back a list of items with they Qids, similar to what @siznax propose (though I am mostly interested in wikidata, not in wikipedia).

I would suggest adding some optional parameters to wptools.search():

  • a list of what to search (i.e. 'wikidata', 'wikipedia', ...). If empty, then search everything (or whatever default you prefer).

  • a dictionary of multiple Property:Value conditions that items should match to be included in the returned list:
    i.e. wptools.search(substring="Michael", conditions={'P31':'Q5', 'P106':'Q3665646'})
    (search all humans that worked as basketball players, containing 'Michael' in their wikidata labels)

  • a boolean details (default False), so that each item in the returned list contains also the full list of conditions it meets:
    wptools.search(substring='Michael', search=['wikidata'], conditions={'P31':'Q5', 'P106':'Q3665646'}, details=True)

    That should return details (instance of human, sex, citizenship, marriages ...) for each 'Michael-labelled' basketball player:

  {
    "wikidata": [
       {'id':'Q41421', 'label':'Michael Jordan', 'description':'American basketball player and businessman (born 1963)', 
         'details':{'P31':'Q5', 'P21':'Q6581097', 'P27':'Q30', 'P26':['Q26220952','Q26220958'], ...} }, 
       {...},
        ...
     ],
  }

Not sure about how to deal with multiple languages. Maybe there should be also a parameter for that.
Sorry, I am by no means an expert on wikidata internals so I don't know if all this makes sense.
I don't even know if substring search should be made on labels or titles (I don't quite understand difference between them, if any).

Regards
@abubelinha