LeonieWeissweiler/UCxn

Proposed notation

Closed this issue · 3 comments

We propose a new layer for selectively annotating constructions on top of UD trees. This is intended for constructions (in the sense of Construction Grammar) whose form and meaning/function is not already captured well by the UD tree. Construction instances receive a type name (possibly from a constructicon resource) and may contain relations to construction elements. The elements of the construction are not constrained by the UD tree: e.g., a construction element may cut across multiple UD subtrees. For now, we envision that they would be marked in the MISC column of .conllu files, though in principle they could be moved to a separate extension column.

The annotation layer does not have the goal of directly indicating the elements of form or meaning that are characteristic of or required by the construction, beyond indicating the construction evoker and spans of construction elements. Aspects of the UD analysis (tags, deprels, morphological features) that are characteristic of a construction's form should be described as such in a type-level constructicon entry. The precise contents of such an entry are not part of this proposal, but constructicons incorporating UD information in some way already exist (e.g., the Russian Constructicon).

Full

Showing three overlapping constructions for completeness:

1	Sam	CxnEltOf=5:predicative-age.Individual,5:property-predication.Subj
2	is	CxnEltOf=5:property-predication.Cop
3	three	CxnEltOf=4:num-mod.Quantity,5:predicative-age.Value
4	years	Cxn=num-mod|CxnEltOf=4:num-mod.Counted,5:predicative-age.Units
5	old	Cxn=predicative-age,property-predication|CxnEltOf=5:property-predication.Pred

This effectively encodes construction-element relationships as dependencies (offset:relation notation echoes DEPS column), which would allow for straightforward graph querying. A common query might be to list the UD deprels associated with a construction element.

Note that i) a word may evoke multiple constructions, ii) a word may be both the evoker and an element of an evoked construction, iii) a word may participate in multiple elements of the same evoked construction.

Comma-separated lists should be sorted primarily by head node (where present), secondarily by construction name, thirdly by construction element name.

Full-consolidated

1	Sam	_
2	is	_
3	three	_
4	years	Cxn=num-mod(3:Quantity,4:Counted)
5	old	Cxn=predicative-age(1:Individual,3:Value,4:Units),property-predication(1:Subj,2:Cop,3-5:Pred)

This is equivalent to the Full representation but consolidates all parts of an evoked construction on one line. It might be suitable for human annotation, to be automatically expanded to the Full representation with a script.

Comma-separated construction elements should be listed in node sort order. Constructions should be sorted alphabetically by name.

Simple

A partial representation may be useful in certain stages of an annotation workflow, e.g. before the full description of the construction is known, or before applying semiautomatic methods to identify construction elements.

The Simple notation includes the name of a construction, omitting any construction elements. A span may optionally be included for rendering purposes, but this span does not necessarily have any theoretical status.

1	Sam	_
2	is	_
3	three	_
4	years	Cxn=3-4:num-mod
5	old	Cxn=1-5:predicative-age,property-predication

Exclusions

When manually reviewing forms that are candidate matches of a construction, it may be helpful to indicate that one of them is a non-match (a false positive). This can be done with the ExcludeCxn feature:

1	Sam	_
2	is	_
3	three	_
4	years	Cxn=3-4:num-mod
5	old	Cxn=1-5:predicative-age,property-predication|ExcludeCxn=object-predication

Though we suggest the name ExcludeCxn in this standard, it should be regarded as a tool for development. Ideally, a corpus will be systematically reviewed for candidates of a construction, and excluded candidates discarded in the final version of the data.

Linking to a constructicon

If a constructicon resource exists, it should be declared in a metadata line in the file, and names of constructions from the resource should be prefixed with a namespace.

TBD issues

  • Where are spans vs. heads used? Is a construction-evoking element allowed to be a span? Allow discontinuous spans (and change existing commas to semicolons)?
  • A status field to indicate auto rather than gold matches?
  • Allow question marks to indicate uncertainty during development?

Advanced example (consolidated notation), showing two candidate matches of the same construction type on the same construction evoker, one of which is correct and one of which is incorrect (indicated by an excluded span):

1	Sam	_
2	is	_
3	so	_
4	glad	Cxn=causal-excess(1:Predicand,3:Degree,3-8:Cause,9-13:Result)|ExcludeCxn=causal-excess(5-8:Result)
5	that	_
6	you	_
7	are	_
8	here	_
9	that	_
10	he	_
11	baked	_
12	a	_
13	cake	_

Or in the simple form:

1	Sam	_
2	is	_
3	so	_
4	glad	Cxn=1-13:causal-excess|ExcludeCxn=1-8:causal-excess
5	that	_
6	you	_
7	are	_
8	here	_
9	that	_
10	he	_
11	baked	_
12	a	_
13	cake	_

Thanks for writing this up! I took a first stab at annotating some English constructions from the linked sources (constructicon.de and Cxn Viewer) automatically using DepEdit scripts, as well as adding Unimorph segmentations. The result looks like this:

image

You can also see an example document in this gist:

https://gist.github.com/amir-zeldes/5c6720415786b98458ea4b64cb0eaef0

Looking forward to discussing more!

Specification doc will be posted in the repo soon