MWE numbering within sentence is inconsistent
Closed this issue · 3 comments
In some sentences all strong MWEs are numbered before weak ones; in others the numbering is by token offset.
This does not matter for the semantics, but it means that equivalent files will be superficially different. So perhaps we should enforce a normal form for numbering MWEs.
In the script for #41:
Lines 53 to 62 in 09014b4
Lines 124 to 129 in 09014b4
For a normal form, it probably makes the most sense to number MWEs in ascending order by start token, using strength only as a tiebreaker (strong before weak—note that weak will be a superset of strong tokens). That way if the strength of an MWE in isolation is modified it won't require renumbering. And if the strength distinction is removed, it will mean collapsing some strong+weak combinations, but not reordering MWEs.
Numbering is renormalized in streusle.conllulex (not yet propagated to splits)