Word Frequency Hyphens misses some words
Closed this issue · 1 comments
Online PPtext report has the following. They do not appear in the GG2 WF hyphens check. Some are false positives (e.g. child like), but I want to see them all for myself.
----- hyphenation and spaced pair check ---------------------------------------
'sea-captain' ❬-❭ 'sea captain'
3603: Clyde had been gentlemen, the one a clergyman, the other a sea-captain.
---
399: since her mother’s death, for her father was a sea captain, who never
'arm-chair' ❬-❭ 'arm chair'
1462: himself in the arm-chair, Guy waited until he came.
2323: sat shelling peas for dinner, and her grandfather in his arm-chair was
6898: arm-chair, the cook-stove, the tongs, Mrs. Noah and Flora, and timidly
---
3691: and sat upon the sofa, near his arm chair. Somehow it rested Guy to look
'finger-tips' ❬-❭ 'finger tips'
5834: feeling the blood tingle to his finger-tips as he thought of his
---
752: Maddy felt the hot blood tingling to her very finger tips, for the
'sick-room' ❬-❭ 'sick room'
1621: into the sick-room, startling even the grandmother, and causing her to
1800: to her sick-room seemed so much like a dream. From her grandfather she
7169: sick-room where Uncle Joseph lay, his thin face upturned to the light,
---
2052: much; but somehow it was very delightful there in that sick room, with
'house-cat' ❬-❭ 'house cat'
7522: no welcome save the purring of the house-cat, who came crawling at her
---
6782: of the clock and the purring of the house cat, which at sight of Maddy
'child-like' ❬-❭ 'child like'
5886: suspicion of the change, and her child-like trust in him was the anchor
---
433: others thought of a child like her becoming a school-mistress. The
Thanks for spotting that Rick - it's definitely a bug. It's not catching the word pairs that are followed by punctuation, like
arm chair.
The fix is very simple, and if you want to try the fix in the version you currently have, then in word_frequency.py at line 863 just make the following change to the regular expression used to detect gaps between words:
whole_text = re.sub(r"(\n| +)", " ", maintext().get_text())
should be
whole_text = re.sub(r"[^-\p{Letter}]+", " ", maintext().get_text())