Syllables are cached incorrectly
StylishTriangles opened this issue · 1 comments
StylishTriangles commented
Hi,
I have run into a rare issue with syllable counting. Due to the way that syllables are currently cached, running readability's getmeasures on the same string twice can produce different results.
Consider the following string: "per se, see" what happens on the first run is this:
- check for se in cache - no results
- se gets it's last e removed, s remains
- there are 0 syllables in s so cache['s'] = 0
- check for see in cache - no results
- see gets its last e removed, se remains
- there is 1 syllable in se, so cache['se'] = 1
Now on the second run we will be checking for se from phrase per se in cache. And we will find 1! which will cause "per se, see" to have 3 syllables on the second run (compared to 2 on the first).
I propose #6 to fix this issue.
Output from IPython:
In [3]: s = 'per se see'
In [4]: readability.getmeasures(s, 'en')['sentence info']['syllables']
Out[4]: 2
In [5]: readability.getmeasures(s, 'en')['sentence info']['syllables']
Out[5]: 3
andreasvc commented
Good catch. I applied your pull request. It's clear that the syllable counting is very much a heuristic approximation, but now at least it should give consistent results.