korpling/paula-xml

XML element and attribute names

gcelano opened this issue · 3 comments

In order to make PAULA XML files lighter, would it be possibile to reduce the names of repeating XML elements and attributes to one single character (e.g., instead of ) ?

Moreover, I see that in tokenization files, the actual tokens are often added as XML comments to help inspection: using an attribute within each element to this end would make XML parsing much more efficient.

PAULA XML is already in use in some projects, so changing the format is probably not realistic, since it would break compatibility for all sorts of users.

I think it's also really not a format that promotes a concise representation - it's very wasteful in many ways, the main use case for it is when you want to have annotations in separate files, so you can easily include or exclude various annotations via file filtering. If you want a more compact format, then depending on what you are doing I would recommend CoNLL-U, or maybe even a combination of formats.

But maybe @thomaskrause sees it differently?

@amir-zeldes , do you mean CoNLL-U Plus? is there an example of CoNLL-U (Plus) standoff annotation files (starting from an untokenized text)?

I think that the strength of PAULA XML is in its clear formalization, even if, as I was suggesting, it could be made lighter.

I meant just plain conllu, without standoff - using MISC you can cram a surprising amount of information in there, as we do in the GUM corpus. Here is an example.