nert-nlp/streusle

MWE numbering within sentence is inconsistent

Closed this issue · 3 comments

In some sentences all strong MWEs are numbered before weak ones; in others the numbering is by token offset.

This does not matter for the semantics, but it means that equivalent files will be superficially different. So perhaps we should enforce a normal form for numbering MWEs.

In the script for #41:

# Note that numbering of strong+weak MWEs doesn't follow a consistent order in the data!
# Ordering by first token offset (tiebreaker to strong MWE):
#xgroups = [(min(sg),'s',sg) for sg in sgroups] + [(min(wg),'w',wg) for wg in wgroups]
# Putting all strong expressions before any weak expressions:
xgroups = [(None,'s',sg) for sg in sgroups] + [(None,'w',wg) for wg in wgroups]
# This means that the MWE columns are not *completely* determined by
# the lextag in a way that matches the original data, but different MWE
# orders does not matter semantically.
# See also check in _postproc_sent(), which ensures that the MWE numbers
# count from 1, but does not mandate an order.

streusle/UDlextag2json.py

Lines 124 to 129 in 09014b4

# check that MWEs are numbered from 1
# fix_mwe_numbering.py was written to correct this
# However, this does NOT require a particular sort order of the MWEs in the sentence.
# It just requires that they have unique numbers 1, ..., N if there are N MWEs.
for i,(k,mwe) in enumerate(sorted(chain(sent['smwes'].items(), sent['wmwes'].items()), key=lambda x: int(x[0])), 1):
assert int(k)==i,(sent['sent_id'],i,k,mwe)

For a normal form, it probably makes the most sense to number MWEs in ascending order by start token, using strength only as a tiebreaker (strong before weak—note that weak will be a superset of strong tokens). That way if the strength of an MWE in isolation is modified it won't require renumbering. And if the strength distinction is removed, it will mean collapsing some strong+weak combinations, but not reordering MWEs.

Numbering is renormalized in streusle.conllulex (not yet propagated to splits)

Fully fixed in #47