curiosity-ai/catalyst

Whitespace added to captured tokens for PatternSpotter

cgreathouse opened this issue · 0 comments

Describe the bug
When a PatternSpotter uses symbols, the value for the captured token will have additional whitespace

To Reproduce
Example for finding possessives

`var possesive = new PatternSpotter(Language.English, 0, "possessive", "Possessive");
possesive.NewPattern("Possessive", p => p.Add(
new PatternUnit(PatternUnitPrototype.Single().WithPOS(PartOfSpeech.PROPN, PartOfSpeech.NOUN)),
new PatternUnit(PatternUnitPrototype.Single().WithToken("'s"))
));

pipline.Add(possesive);
var doc = new Document("The dog's bone", Language.English);
pipeline.ProcessSingle(doc);

var tokens= doc.SelectMany(span => span.GetCapturedTokens()).Select(e => new
{
e.Begin,
e.End,
e.Value
});`

There will be a whitespace between ' and s (i.e. dog' s)

Something similar happens with capturing words wrapped in quotes

Example pattern

var doubleQuoted = new PatternSpotter(Language.English, 0, "double-quoted", "DoubleQuoted"); doubleQuoted.NewPattern("DoubleQuoted", p => p.Add( new PatternUnit(PatternUnitPrototype.Single().WithToken("\"")), new PatternUnit(PatternUnitPrototype.ShouldNotMatch().WithToken("\"")), new PatternUnit(PatternUnitPrototype.Single().WithToken("\"")) ));

Test string : A sentence that "has double quotes" in it

The captured token will have 2 additional whitespaces added (i.e. " has double quotes ")

Expected behavior
No additional whitespace (i.e. dog's)