AjaxMultiCommentary/AjMC-NE-corpus

comply to final HIPE data format

mromanello opened this issue · 2 comments

Changes to implement:

  • naming of files (e.g. HIPE-2022-v1.0-ajmc-train-de.tsv)
  • move the dataset's version number to the document metadata, and remove from file name
  • add namespaces to document metadata (TBC)
  • change EndOfLine to EndOfSentence (because that's what it is)
  • add language metadata

more metadata fields to add:

  • hipe2022:applicable_columns
  • ajmc:license

W.r.t. license: go for CC-BY or CC-BY-NC (tbd).