TokensRegex cannot detect rules cross the period '.'
Closed this issue ยท 10 comments
The task is to detect apartment number via tokensRegex.
Example sentence: I live in 123 Pretty RD, APT. #456.
Here is the rule used to detect the apartment number: { ruleType: "tokens", pattern: ( /APT/ /./ /#/ [{word:/[0-9]+/}]), action: Annotate($0, ner, "APT#"), result:"APARTMENT NUMBER"}
Above rule failed to detect the pattern APT. #456. It looks like TokensRegex cannot correctly recognize the rule across the period '.'
A guess is a change in line 713 would do the trick โฆ
https://github.com/stanfordnlp/CoreNLP/blob/main/src/edu/stanford/nlp/process/PTBLexer.flex
That's an excellent guess, and in fact that's the first place I went to when trying to diagnose this. The problem is clearly that the tokenizer is splitting this into two sentences rather than keeping it as one. However, despite having Apt
with that exact capitalization in the abbreviation list there, the tokenizer is actually case-insensitive to those abbreviations. Here are a couple examples:
I lived at APT. 303 for many years ... one sentence
I lived at APT. #303 for many years .... two sentences
I lived at Apt. 303 for many years ... one sentence
I lived at Apt. #303 for many years ... two sentences
My belief is that we need to add #
as a token that can follow ABBREV3
, at which point it will work as desired. In the meantime, if you happen to be using data of sentence each, you can always use the options that force the tokenizer to produce exactly one sentence per query.
It's a small change, but I hesitate to make such a change without running it by my PI @manning (who is unfortunately out of town for the next couple weeks). If you want to give the abbrev3_hash
branch a try, it might work better for your purposes.
Thank you for the fix and quick response!
Hi @AngledLuffa, I wonder if @manning had a chance to see if we can merge the current fix? Thank you.
It's already merged. I can make a new release that includes it soon if you need, or you can just use the dev branch
It's already merged. I can make a new release that includes it soon if you need, or you can just use the dev branch
That would be great if we can have a release version of it. Thanks very much for the quick response!
It's not an official release yet, but I built a version here:
https://nlp.stanford.edu/software/stanford-corenlp-4.5.5b.zip
I'd like to make some more changes before making an official release
--Thanks very much for the release!
--Looking forward to the official release because our security policies only allow official releases of dependencies to move to production.
--Shall I know when the official release will occur?
Thanks!
Now released in 4.5.6 (may take a little time to show up on Maven)
Thanks so much! I appreciate it @AngledLuffa