Code for pre-processing wikidata json dump?
Closed this issue · 8 comments
Could you please also share the code that you use for pre-processing the wikidata json dump? This would be an enormous help. Thanks!
The relevant code which you are talking about involves multiple stages of entity and relation filtering, and we have had dozens of scripts to do that. Unfortunately, we can't share that as it is not properly documented, and also the original dump is not required to deal with the dataset in any way.
You could take a look at https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON, read the original dump line by line and extract just the property
info.
Hey @vardaan123 , I can totally understand. The reason I wanted the pre-processing script is that I am having trouble understanding the pre-processed files. Specifically,
-
I could not find
wikidata_short_1.json
andwikidata_short_2.json
after extracting the zip file as mentioned on the download page. I believe these are the files that contain the actual KB triples? I did however findcomp_wikidata_rev.json
which, I believe, contains the triples but in the reverse order. I could possibly extract the real triples by reversing the triples fromcomp_wikidata_rev.json
. But I was wondering if this was done on purpose? -
In your paper, you mention shortlisting 330 relations. However, on parsing the
comp_wikidata_rev.json
I found 357 unique relation ids. Why this difference? -
From my understanding, it seems that
wikidata_type_dict.json
contains the projection of actual triples onto the entity types. But I found just 335 unique relation ids which is less than the 357 ofcomp_wikidata_rev.json
( see point 2). What is the reason for this? -
Between
items_wikidata_n.json
andchild_par_dict_name_2_corr.json
I see that some labels have been modified. Likeadministrative territorial entity of Cyprus
becameadministrative territory of Cyprus
,administrative territorial entity of the United States
becameUS administrative territory
andtypes of tennis match
becametennis match
. What is the reason for this? -
Is type of an entity an explicit attribute in Wikidata dump or is it derived from the hierarchy? (Sorry, I am not very familiar with the Wikidata schema). I ask this because I did not understand the following statement in your paper:-
Similarly, of the 30.8K unique
entity types in wikidata, we selected 642 types (considering
only immediate parents of entities) which appeared in the top
90 percentile of the tuples associated with atleast one of the
retained meaningful relations
By considering only immediate parents of entities
do you mean that the number of distinct types of parent entities is 642 or do you mean there is some connection between the type of an entity and its parent? If its the latter, what connection do they have?
- Going by the
wikidata_type_dict.json
it seems as though it has 2495 entity types instead of just 642 types mentioned in the paper.
It would be great if you could help me understand the reasons for the above. I have a few more doubts but I am hoping they will get cleared once I can understand the reasons for above points.
Thanks,
Hey, the zip file is only for the dialogs. The wikidata jsons are shared in a separate google drive folder
(https://drive.google.com/drive/folders/1ITcgvp4vZo1Wlb66d_SnHvVmLKIqqYbR?usp=sharing) which is given on the website. All the req. wikidata jsons are given in this dir.
- To extract the forward triples, you just need a concat of
wikidata_short_1.json
andwikidata_short_2.json
. - The relations/entities in some of these jsons might be a super-set of what is actually used while instantiating the templates. It is kind of troublesome to update the jsons everytime we discard some relations.
- See point 2.
- We reduce the verbosity of some entity names based on feedback received from a set of researchers who tried to use this dataset.
- The type info is not explicitly encoded in wikidata. We consider 642 entity types, because they cover 90 percentile of the tuples. There are some properties like "instance_of" through which you can get an idea of the type.
- see above
Also, it is much better if you could send a consolidated email after studying the code/dataset in detail. We don't like to disappoint people, but we have limited bandwidth to answer queries.
@vardaan123 , Thanks for the answers! I apologise if you felt that I was taking too much of your time.
From the point of view of reproducibility of results, it becomes very difficult if one does not have access to the exact environment that those results were produced in. Could you please provide the original dataset on which the results mentioned in the paper were obtained? Or a paper having updated results as per the updated dataset?
Regarding wikidata_short_1.json
and wikidata_short_2.json
: I used the same link to download but it seems that Google Drive is messing up while it makes the zip file (which includes all the files in your directory). I'll download each file manually.
Thanks,
you could use a script like this to download Google Drive files
import requests
def download_file_from_google_drive(id, destination):
def get_confirm_token(response):
for key, value in response.cookies.items():
if key.startswith('download_warning'):
return value
return None
def save_response_content(response, destination):
CHUNK_SIZE = 32768
with open(destination, "wb") as f:
for chunk in response.iter_content(CHUNK_SIZE):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
URL = "https://docs.google.com/uc?export=download"
session = requests.Session()
response = session.get(URL, params = { 'id' : id }, stream = True)
token = get_confirm_token(response)
if token:
params = { 'id' : id, 'confirm' : token }
response = session.get(URL, params = params, stream = True)
save_response_content(response, destination)
if __name__ == "__main__":
import sys
if len(sys.argv) is not 3:
print("Usage: python google_drive.py drive_file_id destination_file_path")
else:
# TAKE ID FROM SHAREABLE LINK
file_id = sys.argv[1]
# DESTINATION FILE ON YOUR DISK
destination = sys.argv[2]
download_file_from_google_drive(file_id, destination)
We will get back to you regarding other ques. soon.
@sanyam5 The paper is due to be updated on arxiv with latest dataset figures. Please stay tuned.