andreasvc/readability

Syllables are cached incorrectly

StylishTriangles opened this issue · 1 comments

Hi,
I have run into a rare issue with syllable counting. Due to the way that syllables are currently cached, running readability's getmeasures on the same string twice can produce different results.
Consider the following string: "per se, see" what happens on the first run is this:

  1. check for se in cache - no results
  2. se gets it's last e removed, s remains
  3. there are 0 syllables in s so cache['s'] = 0
  4. check for see in cache - no results
  5. see gets its last e removed, se remains
  6. there is 1 syllable in se, so cache['se'] = 1

Now on the second run we will be checking for se from phrase per se in cache. And we will find 1! which will cause "per se, see" to have 3 syllables on the second run (compared to 2 on the first).

I propose #6 to fix this issue.

Output from IPython:

In [3]: s = 'per se see'                                                                            

In [4]: readability.getmeasures(s, 'en')['sentence info']['syllables']                              
Out[4]: 2

In [5]: readability.getmeasures(s, 'en')['sentence info']['syllables']                              
Out[5]: 3

Good catch. I applied your pull request. It's clear that the syllable counting is very much a heuristic approximation, but now at least it should give consistent results.