kermitt2/entity-fishing

General Statistics of Retrievable Wiki Entities

ndenStanford opened this issue · 4 comments

Would you mind letting me know where I can see the general statistics of the retrievable wiki entities? I would love to understand the size of the retrievable knowledge base broken down into each entity category. If such statistics are not available, can you please elaborate on how can I read from the database directly? I am seeing that key and value of the database are encoded and I am not able to decode them since I do not know how it was encoded.

Appreciate your attention to this matter :)

Hello !

When the service starts, the console prints some basic statistics on the loaded data. For instance, with the currently provided KB files:

building Environment for upper knowledge-base
Environment built - 100009060 concepts.
                  - 18271696 loaded statements.

Upper knowledge base is Wikidata, this indicates the number of Wikidata entities loaded (=concepts) and the number of wikidata statements. By default, for a given entity, we only load statements if this entity exists in at least one language-specific Wikipedia, there's a parameter config to change this to load all the statements:

restrictConceptStatementsToWikipediaPages: true

This is based on Wikidata dump from December 2022.

Then, we have the basic Wikipedia loading information for every installed/supported language:

init Environment for language en
building Environment for language en
conceptByPageIdDatabase / isLoaded: true
Environment built - 18285639 pages.
init Environment for language de
building Environment for language de
conceptByPageIdDatabase / isLoaded: true
Environment built - 4651613 pages.
init Environment for language fr
building Environment for language fr
conceptByPageIdDatabase / isLoaded: true
Environment built - 4595403 pages.
init Environment for language es
building Environment for language es
conceptByPageIdDatabase / isLoaded: true
Environment built - 4083089 pages.
init Environment for language it
building Environment for language it
conceptByPageIdDatabase / isLoaded: true
Environment built - 3100632 pages.
init Environment for language ar
building Environment for language ar
conceptByPageIdDatabase / isLoaded: true
Environment built - 3122743 pages.
init Environment for language zh
building Environment for language zh
conceptByPageIdDatabase / isLoaded: true
Environment built - 2940719 pages.
init Environment for language ja
building Environment for language ja
conceptByPageIdDatabase / isLoaded: true
Environment built - 2442126 pages.
init Environment for language ru
building Environment for language ru
conceptByPageIdDatabase / isLoaded: true
Environment built - 4996639 pages.
init Environment for language pt
building Environment for language pt
conceptByPageIdDatabase / isLoaded: true
Environment built - 2232943 pages.
init Environment for language fa
building Environment for language fa
conceptByPageIdDatabase / isLoaded: true
Environment built - 3431204 pages.
init Environment for language sv
building Environment for language sv
conceptByPageIdDatabase / isLoaded: true
Environment built - 4786878 pages.
init Environment for language uk
building Environment for language uk
conceptByPageIdDatabase / isLoaded: true
Environment built - 2379489 pages.
init Environment for language bn
building Environment for language bn
conceptByPageIdDatabase / isLoaded: true
Environment built - 459096 pages.
init Environment for language hi
building Environment for language hi
conceptByPageIdDatabase / isLoaded: true
Environment built - 257865 pages.

Parsing and compilation of all the Wikidata and Wikipedia resources is done by https://github.com/kermitt2/grisp

I should certainly output more statistics on everything loaded and create a web service to get all the statistics of the current KB used by a running service.

About the distribution of entities according to categories, I am not sure what you mean by categories, but you will find statistics on the Wikidata web site.

Much appreciate the reply. I have a couple of follow-up questions. Does that mean all the fetched Wikipedia (total of around 62M from summing up all pages) and 18271696 Wikidata entities can be linked to the mentions in the text? The rest of WIkidata id cannot. It's also quite a surprise to see how small the number of loaded statements is compared to the number of concepts so I wonder why that is.

@ndenStanford The current state of the tool is as follow:

  • for a given language, only a Wikidata entity having a corresponding "page" in the language Wikipedia can be "linked" in context, because it requires some textual usage context (anchors), synonyms (redirections) and links for the disambiguation
  • so the proportion of Wikidata entity that can be disambiguated in context depends on the language and its Wikipedia

It's why it's hard to support currently languages with less than one million Wikipedia pages. Without enough training material in the language, page and page interlinking, we don't have enough data to disambiguate terms.

But I think it's still useful to have access to the whole Wikidata (the 100M entities) via the KB API because for example I have other tools doing more specialized entity mention extraction and I exploit and link these other entities directly. So the 100M entities are loaded.

It's also quite a surprise to see how small the number of loaded statements is compared to the number of concepts so I wonder why that is.

As I wrote, by default, we only load statements of an entity if this entity "exists" in at least one language-specific Wikipedia. It appears that usually the same subset of entities are present in the different Wikipedia, and not every pages in a Wikipedia correspond to a Wikidata entity (many Wikipedia pages are redirection, disambiguation page, category, etc. only articles are mapped to a Wikidata entity, for example around 6M for English, out of 18M pages).

It's possible to load all the statements by changing the config, for example I have an older version of entity-fishing with all statements loaded and it's more than 1B statements. Due to the size of the resulting DB and the extra time indexing this, it's turn off by the default.

Thank you very much for the clarification :)