clulab/processors

New date patterns

Closed this issue · 20 comments

Hi @EgoLaparra,
Can you please add two more date/range patterns for:
"month of X", e.g., "month of January", and
"before mid-July"?

Thank you!
Mihai

I added the "month of X" pattern in this PR:
#585

So, the only thing left to do is "before mid-July".

@MihaiSurdeanu how do you want to normalize such expressions? Something like "before mid-July" -> "XXXX-XX-XX -- XXXX-07-15"? Do you want us to also cover "the beginning of July" and "the end of July"?

@EgoLaparra: yes on all questions :)

I think "mid"|"middle", "beginning"|"start", "end" probably captures most situations. Not sure yet how to handle the "mid-July" as a single token. Maybe we should split it into "mid" and "July"?

I think we should split it so we can handle all those cases in a similar way.

@EgoLaparra: This tokenization issue has now been fixed in this PR:
#586
which has been merged in master.

"mid-July" is tokenized into "mid" and "July" (the dash disappears).
Thanks!

@kwalcock : I wonder what other prefixes similar to "mid-" should be tokenized out?

The article at https://www.proofreadnow.com/blog/master-prefixes-and-suffixes-with-hyphens suggests trans- and non- along with mid-. Perhaps they would help with parsing.

Hyphens are used after all prefixes preceding a proper noun, a number, or an abbreviation (e.g., "trans-Atlantic network," "mid-1960s," or "non-GABAergic responses").

There's also this list:

Use a hyphen after the following prefixes in most words: "all-", "cross-", "ex-", and "self-" (e.g., “self-service,” “ex-boyfriend,” “all-encompassing”). Most "servo-" words are also hyphenated with the following two exceptions: “servomechanism” and “servomotor.”

For our numeric parsing do we account for -fold in any way?

"bi" and "semi" are often not written with hyphens but the construction is fairly productive. If these prefixes are added to dates, are they still recognized?

From these, I think the following should be tokenized because: (a) they commonly appear as separate tokens as well, and (b) they may impact the downstream components in processors: bi, semi, non, all.

These are addressed in this PR:
#587

I'll merge it as soon as tests pass.

@EgoLaparra: This tokenization issue has now been fixed in this PR: #586 which has been merged in master.

"mid-July" is tokenized into "mid" and "July" (the dash disappears). Thanks!

Great, thanks!

Sorry, I accidentally closed the issue 😅

@mihai, for what cases is this rule required? Do not date-yyyy-mm-dd, date-dd-mm-yyyy, ... cover dates with numeric months? I've tried commenting it out and all tests pass.

# month values, 1 to 12
- name: month-values
label: PossibleMonth
priority: ${ rulepriority }
type: token
pattern: |
[word=/^([1-9]|1[012])$/]

The issue is that I need to produce dates with only month names to solve cases like mid-July, and that rule is not very convenient since it interprets single numbers (from 1 to 12) as months. Could we safely remove it?

Maybe... Can you please try to remove it and see if all unit tests pass?

All the tests pass, it seems that we can remove it. I'll make the change and add a rule to identify single-month dates.

Sounds good. Thank you @EgoLaparra !

Has this been merged in master @EgoLaparra ? Close the issue?

Not totally yet. The support for modifiers (mid-July) will be merged with #596

close this now?

Yes!