/KeyakiTreebank

Keyaki Treebank Parsed Corpus

Primary LanguageShellOtherNOASSERTION

Keyaki Treebank (Version 1.1)
=============================

Thanks for your interest in this release of the Keyaki Treebank.

The Keyaki Treebank is a parsed corpus that aims to instantiate a coherent
descriptive grammar of the Japanese language, allowing searches for
a wide variety of grammatical phenomena.  More information about the
corpus can be found on the project website:

http://www.compling.jp/keyaki/


Contents of this File
=====================

(A) Corpus format
(B) Recovering the full data
(C) Citation
(D) Contributors
(E) Licence


(A) Corpus Format
=================

Keyaki Treebank files are formatted to be compatible with CorpusSearch,
see http://corpussearch.sourceforge.net/CS-manual/YourCorpus.html for
details.


(B) Recovering the full data
============================

The 'closed' folder contains annotations for data sourced from the
following purchasable resources, which are required to reinstate words
stripped off due to license issues:

- Mainichi Shinbun 1995 CD-ROM data collection (the same set of data
  as used by the Kyoto Text Corpus). Available from Nichigai Associates:
  http://www.nichigai.co.jp/sales/mainichi/mainichi-data.html.

- Corpus of Spontaneous Japanese (CSJ) DVD-ROM data
  Available from http://pj.ninjal.ac.jp/corpus_center/csj/en/

- Balanced Corpus of Comtemporary Written Japanese (BCCWJ) DVD edition
  Available from http://pj.ninjal.ac.jp/corpus_center/bccwj/.

- (SIDB) Simultaneous Interpretation Database
  Available from http://sidb.jp/

The full corpus is obtained by running the following commands:

 ./scripts/collect_MAI_data -d MAINICHI_DIR
 ./scripts/integrate_MAI_characters
 ./scripts/integrate_CSJ_characters --source CSJ_DIR
 ./scripts/integrate_BCCWJ_characters --source BCCWJ_DIR
 ./scripts/integrate_SIDB_characters --source SIDB_DIR

where:

 'MAINICHI_DIR' is the directory of the files of Mainichi Shinbun 1995.
 'CSJ_DIR' is the directory of the files of the CSJ DVD-ROM
 'BCCWJ_DIR' is the directory of the files of the BCCWJ DVD-ROM
 'SIDB_DIR' is the directory of the files of the SIDB

In order to run these commands, you need Perl, Python, Gawk, Tregex
(https://nlp.stanford.edu/software/tregex.shtml) and munge-trees
(http://web.science.mq.edu.au/~mjohnson/Software.htm).

For Tregex to work, stanford-tregex.jar should be placed into the
same directory as this README file.

The program for extracting texts from Mainichi
Shinbun 1995 is from the Kyoto Corpus project
(http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?Kyoto%20University%20Text%20Corpus).

(C) Citation
============

When presenting research results taken from this treebank, please be
sure to include a citation in the following form:

Alastair Butler, Kei Yoshimoto, Shota Hiyama, Stephen Wright Horn,
Iku Nagasaki, and Ai Kubota. 2018. The Keyaki Treebank Parsed Corpus,
Version 1.1 (http://www.compling.jp/Keyaki/ accessed YYYY/MM/DD).


(D) Contributors
================

Alastair Butler (Hirosaki University)
Kei Yoshimoto (Tohoku University)
Shota Hiyama (Tohoku University)
Stephen Wright Horn (National Institute for Japanese Language and Linguistics)
Iku Nagasaki (National Institute for Japanese Language and Linguistics)
Ai Kubota (National Institute for Japanese Language and Linguistics)
Ken Kishiyama (University of Tokyo)
Makoto Orikasa (Sophia University)
Noritugu Hayashi (University of Tokyo),
Ruriko Otomo (University of Hong Kong)
Tomoya Kosuge (Tohoku University)
Shinya Okano (University of Tokyo)
Yumiko Kinjo (National Institute for Japanese Language and Linguistics)
Ryosuke Sato (Tohoku University)
Takumi Toda (Tohoku University)
Tsaiwei Fang (Tohoku University)
Yusuke Kubota (University of Tsukuba)
Yoshiko Ookubo (JSA)
Y. Ueda (JSA)
Natsuha Katakura (Tohoku University)


(E) Licence
===========

The corpus annotation (the grammatical analysis) is licensed under
the Creative Commons Attribution 4.0 International License. To view a copy
of this license, visit http://creativecommons.org/licenses/by/4.0/ or send
a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

The unannotated corpus material belongs to the authors and publishers
as detailed in the metadata.