Whitespace added to captured tokens for PatternSpotter
cgreathouse opened this issue · 0 comments
Describe the bug
When a PatternSpotter uses symbols, the value for the captured token will have additional whitespace
To Reproduce
Example for finding possessives
`var possesive = new PatternSpotter(Language.English, 0, "possessive", "Possessive");
possesive.NewPattern("Possessive", p => p.Add(
new PatternUnit(PatternUnitPrototype.Single().WithPOS(PartOfSpeech.PROPN, PartOfSpeech.NOUN)),
new PatternUnit(PatternUnitPrototype.Single().WithToken("'s"))
));
pipline.Add(possesive);
var doc = new Document("The dog's bone", Language.English);
pipeline.ProcessSingle(doc);
var tokens= doc.SelectMany(span => span.GetCapturedTokens()).Select(e => new
{
e.Begin,
e.End,
e.Value
});`
There will be a whitespace between ' and s (i.e. dog' s)
Something similar happens with capturing words wrapped in quotes
Example pattern
var doubleQuoted = new PatternSpotter(Language.English, 0, "double-quoted", "DoubleQuoted"); doubleQuoted.NewPattern("DoubleQuoted", p => p.Add( new PatternUnit(PatternUnitPrototype.Single().WithToken("\"")), new PatternUnit(PatternUnitPrototype.ShouldNotMatch().WithToken("\"")), new PatternUnit(PatternUnitPrototype.Single().WithToken("\"")) ));
Test string : A sentence that "has double quotes" in it
The captured token will have 2 additional whitespaces added (i.e. " has double quotes ")
Expected behavior
No additional whitespace (i.e. dog's)