Abbreviations and glyphs not recognised. Lags behind perl version
kimryan opened this issue · 4 comments
Hi, I am maintaining the original Perl version of this module here: https://github.com/kimryan/Lingua-EN-Sentence. I found several issues with this version
- The list of abbreviations has fallen behind the ones in the Perl modules
- Sentence are breaking after an abbreviation such as '... Esq. more text'
- elllpis '...' and . . causes sentence breaks
- breaks if digit . such as 1. is in mid sentence
I upgraded my module to fix point 2, but it handles all the other cases ok
use Lingua::EN::Sentence;
my Str $text = Q[A sentence usually ends with a dot, exclamation or question mark optionally followed by a space!
A string followed by 2 carriage returns denotes a sentence, even though it doesn't end in a dot
Dots after single letters such as U.S.A. or in numbers like -12.34 will not cause a split
as well as common abbreviations such as Dr. I. Smith, Ms. A.B. Jones, Apr. Calif. Esq. and others
and some text ellipsis such as ... or . . are ignored.
Some valid cases cannot be detected, such as the answer is X. It cannot easily be
differentiated from the single letter-dot sequence to abbreviate a person's given name.
Numbered points within a sentence will not cause a split 1. Like this one.
See the code for all the rules that apply.
This string has 7 sentences.];
my @Sentences = $text.sentences;
my $i;
for @Sentences -> $sub-element {
$i++;
say "SENTENCE $i:$sub-element";
}
SENTENCE 1:A sentence usually ends with a dot, exclamation or question mark optionally followed by a space!
SENTENCE 2:A string followed by 2 carriage returns denotes a sentence, even though it doesn't end in a dot
SENTENCE 3:Dots after single letters such as U.S.A. or in numbers like -12.34 will not cause a split
as well as common abbreviations such as Dr. I. Smith, Ms. A.B. Jones, Apr. Calif. Esq.
SENTENCE 4:and others
and some text ellipsis such as .
SENTENCE 5:or .
SENTENCE 6:are ignored.
SENTENCE 7:Some valid cases cannot be detected, such as the answer is X. It cannot easily be
differentiated from the single letter-dot sequence to abbreviate a person's given name.
SENTENCE 8:Numbered points within a sentence will not cause a split 1.
SENTENCE 9:Like this one.
SENTENCE 10:See the code for all the rules that apply.
SENTENCE 11:This string has 7 sentences..
Hi @kimryan ,
I haven't used Rakudo for just about as long as the last commit of this repository. So I have no further plans of maintaining the package further in the Perl 6 ecosystem.
Let me know if you - or someone else - is interested in taking over maintenance.
Hi @dginev ,
I haven't done any Raku coding but should be able to pick it up. Happy to take over maintenance of this package.
@kimryan feel free to file a PR updating the repository to the perl 5 state. After that I'll try to transfer ownership.
(Though I'd have to double-check how that is done in the Raku world nowadays)
Fixed in latest release.