kimryan/perl6-Lingua-EN-Sentence

Abbreviations and glyphs not recognised. Lags behind perl version

kimryan opened this issue · 4 comments

Hi, I am maintaining the original Perl version of this module here: https://github.com/kimryan/Lingua-EN-Sentence. I found several issues with this version

  1. The list of abbreviations has fallen behind the ones in the Perl modules
  2. Sentence are breaking after an abbreviation such as '... Esq. more text'
  3. elllpis '...' and . . causes sentence breaks
  4. breaks if digit . such as 1. is in mid sentence
    I upgraded my module to fix point 2, but it handles all the other cases ok

use Lingua::EN::Sentence;
my Str $text = Q[A sentence usually ends with a dot, exclamation or question mark optionally followed by a space!
A string followed by 2 carriage returns denotes a sentence, even though it doesn't end in a dot

Dots after single letters such as U.S.A. or in numbers like -12.34 will not cause a split
as well as common abbreviations such as Dr. I. Smith, Ms. A.B. Jones, Apr. Calif. Esq. and others
and some text ellipsis such as ... or . . are ignored.
Some valid cases cannot be detected, such as the answer is X. It cannot easily be
differentiated from the single letter-dot sequence to abbreviate a person's given name.
Numbered points within a sentence will not cause a split 1. Like this one.
See the code for all the rules that apply.
This string has 7 sentences.];

my @Sentences = $text.sentences;
my $i;
for @Sentences -> $sub-element {
$i++;
say "SENTENCE $i:$sub-element";
}

SENTENCE 1:A sentence usually ends with a dot, exclamation or question mark optionally followed by a space!
SENTENCE 2:A string followed by 2 carriage returns denotes a sentence, even though it doesn't end in a dot
SENTENCE 3:Dots after single letters such as U.S.A. or in numbers like -12.34 will not cause a split
as well as common abbreviations such as Dr. I. Smith, Ms. A.B. Jones, Apr. Calif. Esq.
SENTENCE 4:and others
and some text ellipsis such as .
SENTENCE 5:or .
SENTENCE 6:are ignored.
SENTENCE 7:Some valid cases cannot be detected, such as the answer is X. It cannot easily be
differentiated from the single letter-dot sequence to abbreviate a person's given name.
SENTENCE 8:Numbered points within a sentence will not cause a split 1.
SENTENCE 9:Like this one.
SENTENCE 10:See the code for all the rules that apply.
SENTENCE 11:This string has 7 sentences..

Hi @kimryan ,

I haven't used Rakudo for just about as long as the last commit of this repository. So I have no further plans of maintaining the package further in the Perl 6 ecosystem.

Let me know if you - or someone else - is interested in taking over maintenance.

Hi @dginev ,
I haven't done any Raku coding but should be able to pick it up. Happy to take over maintenance of this package.

@kimryan feel free to file a PR updating the repository to the perl 5 state. After that I'll try to transfer ownership.

(Though I'd have to double-check how that is done in the Raku world nowadays)

Fixed in latest release.