/FairyMaCorpus

Fairy Morphological Annotated Corpus

Primary LanguagePythonApache License 2.0Apache-2.0

Fairy Morphological Annotated Corpus

CircleCI Apache License

This corpus includes morphological partial annotations for Japanese Wikipedia. The main purpose is more like error check for morphological analyzers than their training. This is a sample data.

また、銀河系にある|いて?座?A?*|のブラックホールの400倍も重い。
(Furthermore, it is also 400 times heavier than the black hole of Sagittarius A * in the galaxy.)

| indicates word boundary. ? between first and last | indicate word boundary candidates.

This corpus reveals some morphological analyzers wrongly parse it as あるい|て (あるい(walk) and て(and)). All annotations are based on JUMAN part of speech system which is extension of the Masuoka and Takubo grammar.

Files

  • corpus
    • First column in each .tsv file includes annotated texts.
    • Other columns contain additional information.
  • scripts

References

@INPROCEEDINGS{hayashibe:2017:SIGNL231,
    author    = {林部祐太},
    title     = {日本語部分形態素アノテーションコーパスの構築},
    booktitle = "情報処理学会第231回自然言語処理研究会",
    year      = "2017",
    pages     = "NL-231-9:1-8",
    publisher = "情報処理学会",
}

License