DavidMStraub/python-gedcom7

wrong parsing of DatePeriod

geostag opened this issue · 6 comments

The DatePeriod parser unintentionally parses the keywords of DatePeriods as month in some cases.

import gedcom7.types as T
T.DatePeriod('FROM 21 JAN 1990 TO 22 FEB 2000').parse()
T.DatePeriod('FROM  1990 TO  2000').parse()

The first DatePeriod ist recognized correctly as period from 1999-01-21 to 2000-02-22. The second one however, which is a valid DatePeriod according to the GEDCOM 7 specification, is parsed as

{
   "from":{
      "calendar":"None",
      "day":1990,
      "month":"TO",
      "year":2000,
      "epoch":"None"
   }
}

The root cause is, that the orginal specification of GEDCOM 7 does not exclude the keywords from the month pattern, thus leading to this ambiguous interpretation, which is formal valid but not intended.

With the following change in grammar.py, you get the intended interpretation:

- month = f'({stdtag}|{exttag})'
+ no_months = f'FROM|TO|BET|AND|AFT|BEF'
+ month = f'(?!{no_months})({stdtag}|{exttag})'

Perhaps this should be fixed in GEDCOM 7 as well.

Interesting, thanks! Yes, I think this should be raised at the Gedcom repository as well, can you open an issue there?

I submitted the question to gedcom-L, the german mailing list for GEDCOM specification.

No reaction so far. Therefore I checked out the issues in https://github.com/FamilySearch/GEDCOM/issues/ . It seems, that their approach is not to reflect everything in the ABNF rules. Some of the rules, which have to be followed to parse GEDCOM correctly, are written in the text part only. E.g.

In addition to the constraints above:
...
No calendar names, months, or epochs match dateRestrict .

in the GEDCOM definition seems to be regarded as sufficient.

Thus I suggest to just add the dateRestrict definition in the grammar and use this in the negative lookahead for the month rules as well in the rules for calendar and epoch.

PR modified to use daterestrict to exclude the keywords from month, epoch and calendar. BCE gets a special handling, because it is a valid epoch.

epoch handling: FamilySearch/GEDCOM#273

Fixed by #3 🚀