Restructure data for performance
jmsv opened this issue · 5 comments
At the moment, the json dataset is structured as follows:
[
{
"a_lang": "eng",
"a_word": "potato",
"b_lang": "tnq",
"b_word": "batata"
},
{ ...
This is loaded as a Python dict and filtered using:
row = list(filter(
lambda entry: entry['a_word'] == self.word and entry[
'a_lang'] == self.language.iso, etymwn_data))
If the data was restructured so words acted as dict keys, referencing words would be much faster since dicts are an implementation of hash tables.
Data could instead be structured by language then by word, as follows:
{
"lang":{
"word":[
{
"origin-word":"origin-lang"
}
]
}
}
for example,
{
"eng":{
"airport":[
{"air":"eng"},
{"port":"eng"}
],
"banana":[
{"banaana":"wol"}
]
},
"lat":{
"fructus":[
{"fruor":"lat"}
]
}
}
Origin words are individual dicts to prevent key collisions.
Open to suggestions for better ways of structuring it
Potentially could expand out the origin dicts:
"eng":{
"airport":[
{"word": "air", "lang": "eng"},
{"word": "port", "lang": "eng"}
],
"banana":[
{"word": "banaana", "lang": "wol"}
]
}
I think it could make the loading of origins slightly clearer:
source_origins = data[self.language.iso][self.word]
origins = [
ety.Word(origin["word"], origin["lang"]) for origin in source_origins
]
vs
source_origins = data[self.language.iso][self.word]
origins = [
ety.Word(*info) for origin in source_origins for info in origin.items()
]
The downside to expanding out the dicts is it'll result in a larger file.
@alxwrd I think it might be better to keep the smaller file and just comment code or something to explain what's happening.
Rather than using *info
, word
and lang
could be unpacked by hand which is probably more readable:
origins = [
ety.Word(word, lang) for origin in source_origins for word, lang in origin.items()
]
Yea that's nice actually 😋
For creating the new data file, is it going to be rebuilt from the original source? Or transforming the current file?
I think it'd be good to start from the source .tsv, and create a build_ety_data
script that could either live in the repo root, or ety/wn/. The script would fetch the archived data, unpack it, perform the transform. Then if there are any updates to the source, the data can easily be updated.
Yeah I was thinking start from original source too. I was in touch with the guy that maintains the dataset a couple of weeks ago and apparently a new version will be released hopefully by August.
A script that stays in the repo is definitely a good idea - this would probably be best kept in ety/wn
.
It's probably a good idea to only download the dataset if it's not available locally, but with the option to redownload; the original source is quite big so downloading is time consuming