Speed
Closed this issue · 4 comments
Thanks a lot for the code and spending time to help. Would appreciate some more of your help:
$ python split.py DavaleSvarapriya MD
Reading knownpairs 2017-07-05 14:27:48.709972
Calculating costs of dictionary headwords 2017-07-05 14:27:48.863055
Calculated costs of dictionary headwords 2017-07-05 14:27:48.874079
Calculated maxword 2017-07-05 14:27:48.876109
valid permutations are 4
2017-07-05 14:27:48.933371
['Davala+ISvara+priya'] 5
2017-07-05 14:27:48.934428
$ python split.py astyuttarasyAmdiSidevatAtmA MD
Reading knownpairs 2017-07-05 14:31:07.751634
Calculating costs of dictionary headwords 2017-07-05 14:31:07.915072
Calculated costs of dictionary headwords 2017-07-05 14:31:07.928610
Calculated maxword 2017-07-05 14:31:07.931826
valid permutations are 1
2017-07-05 14:31:54.897486
astyuttarasyAmdiSidevatAtmA 4
2017-07-05 14:31:54.898249
The first invocation is impressively fast!
The second does not split at all. Any idea why?
My implementation, in comparison (unoptimized right now) is much slower, but able to split the latter as well.
python SanskritLexicalAnalyzer.py DavaleSvarapriya --split --input-encoding SLP1
Parsing of XMLs started at 2017-07-05 14:30:33.822133
Parsing of XMLs completed at 2017-07-05 14:30:37.894983
DavaleSvarapriya
DavaleSvarapriya
Start split: Wed, 05 Jul 2017 14:30:39
End split: Wed, 05 Jul 2017 14:32:39
[[u'Davala', [[u'ISvara', [[u'priya', None]]]]], [u'DavalA', [[u'ISvara', [[u'priya', None]]]]]]
My output is currently unflattened. As you can see, I have an extra split - Davala & DavalA both work lexically
Now for the longer phrase:
python SanskritLexicalAnalyzer.py astyuttarasyAmdishidevatAtmA --split
Parsing of XMLs started at 2017-07-05 14:17:17.944997
Parsing of XMLs completed at 2017-07-05 14:17:21.984150
astyuttarasyAmdishidevatAtmA
astyuttarasyAmdiSidevatAtmA
Start split: Wed, 05 Jul 2017 14:17:23
End split: Wed, 05 Jul 2017 14:23:47
[[u'asti', [[u'ut', [[u'tara', [[u'syAm', [[u'diSi', [[u'de', [[u'vatA', [[u'at', [[u'mA', None]]]]], [u'vatA', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'vatAt', [[u'mA', None]]]]], [u'de', [[u'avat', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'avata', [[u'at', [[u'mA', None]]]]], [u'avata', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'avatA', [[u'at', [[u'mA', None]]]]], [u'avatA', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]]]], [u'devat', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'devata', [[u'at', [[u'mA', None]]]]], [u'devata', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'devatA', [[u'at', [[u'mA', None]]]]], [u'devatA', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]]]]]]]], [u'taras', [[u'yAm', [[u'diSi', [[u'de', [[u'vatA', [[u'at', [[u'mA', None]]]]], [u'vatA', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'vatAt', [[u'mA', None]]]]], [u'de', [[u'avat', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'avata', [[u'at', [[u'mA', None]]]]], [u'avata', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'avatA', [[u'at', [[u'mA', None]]]]], [u'avatA', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]]]], [u'devat', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'devata', [[u'at', [[u'mA', None]]]]], [u'devata', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'devatA', [[u'at', [[u'mA', None]]]]], [u'devatA', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]]]]]]]]]], [u'uttas', [[u'asyAm', [[u'diSi', [[u'de', [[u'vatA', [[u'at', [[u'mA', None]]]]], [u'vatA', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'vatAt', [[u'mA', None]]]]], [u'de', [[u'avat', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'avata', [[u'at', [[u'mA', None]]]]], [u'avata', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'avatA', [[u'at', [[u'mA', None]]]]], [u'avatA', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]]]], [u'devat', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'devata', [[u'at', [[u'mA', None]]]]], [u'devata', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'devatA', [[u'at', [[u'mA', None]]]]], [u'devatA', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]]]]]]]], [u'uttara', [[u'syAm', [[u'diSi', [[u'de', [[u'vatA', [[u'at', [[u'mA', None]]]]], [u'vatA', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'vatAt', [[u'mA', None]]]]], [u'de', [[u'avat', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'avata', [[u'at', [[u'mA', None]]]]], [u'avata', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'avatA', [[u'at', [[u'mA', None]]]]], [u'avatA', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]]]], [u'devat', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'devata', [[u'at', [[u'mA', None]]]]], [u'devata', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'devatA', [[u'at', [[u'mA', None]]]]], [u'devatA', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]]]]]]]], [u'uttaras', [[u'yAm', [[u'diSi', [[u'de', [[u'vatA', [[u'at', [[u'mA', None]]]]], [u'vatA', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'vatAt', [[u'mA', None]]]]], [u'de', [[u'avat', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'avata', [[u'at', [[u'mA', None]]]]], [u'avata', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'avatA', [[u'at', [[u'mA', None]]]]], [u'avatA', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]]]], [u'devat', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'devata', [[u'at', [[u'mA', None]]]]], [u'devata', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'devatA', [[u'at', [[u'mA', None]]]]], [u'devatA', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]]]]]]]], [u'uttarasyAm', [[u'diSi', [[u'de', [[u'vatA', [[u'at', [[u'mA', None]]]]], [u'vatA', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'vatAt', [[u'mA', None]]]]], [u'de', [[u'avat', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'avata', [[u'at', [[u'mA', None]]]]], [u'avata', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'avatA', [[u'at', [[u'mA', None]]]]], [u'avatA', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]]]], [u'devat', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'devata', [[u'at', [[u'mA', None]]]]], [u'devata', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]], [u'devatA', [[u'at', [[u'mA', None]]]]], [u'devatA', [[u'A', [[u'at', [[u'mA', None]]]]], [u'AtmA', None], [u'Atma', None]]]]]]]]]]
Other than the atma/atmA ambiguity at the end (which is a problem with the sandhi method), all sem like valid splits, to be disambiguated morphologically. I will now need to optimize the code. Before that, I will ask a question on sandhi separately.
I think I'll go with the other implementation that I outlined in
kmadathil/sanskrit_parser#4
Now that's fast and can use actual forms.
Gerard's on github, his own.