An Open Source Japanese NLP Library, based on Universal Dependencies
Please read the Important changes before you upgrade GiNZA.
GiNZA NLP Library and GiNZA Japanese Universal Dependencies Models are distributed under The MIT License. You must agree and follow The MIT License to use GiNZA NLP Library and GiNZA Japanese Universal Dependencies Models.
spaCy is the key framework of GiNZA. spaCy LICENSE PAGE
SudachiPy provides high accuracies for tokenization and pos tagging. Sudachi LICENSE PAGE, SudachiPy LICENSE PAGE
The parsing model of GiNZA v4 is trained on a part of UD Japanese BCCWJ v2.6 (Omura and Asahara:2018). This model is developed by National Institute for Japanese Language and Linguistics, and Megagon Labs.
The named entity recognition model of GiNZA v4 is trained on a part of GSK2014-A (2019) BCCWJ edition (Hashimoto, Inui, and Murakami:2008). We use two of the named entity label systems, both Sekine's Extended Named Entity Hierarchy and extended OntoNotes5. This model is developed by National Institute for Japanese Language and Linguistics, and Megagon Labs.
This project is developed with Python>=3.6 and pip for it. We do not recommend to use Anaconda environment because the pip install step may not work properly. (We'd like to support Anaconda in near future.)
Please also see the Development Environment section below.
Run following line
$ pip install -U ginza
If you encountered some install problems related to Cython, please try to set the CFLAGS like below.
$ CFLAGS='-stdlib=libc++' pip install ginza
Run ginza
command from the console, then input some Japanese text.
After pressing enter key, you will get the parsed results with CoNLL-U Syntactic Annotation format.
$ ginza
銀座でランチをご一緒しましょう。
# text = 銀座でランチをご一緒しましょう。
1 銀座 銀座 PROPN 名詞-固有名詞-地名-一般 _ 6 obl _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|NP_B|Reading=ギンザ|NE=B-GPE|ENE=B-City
2 で で ADP 助詞-格助詞 _ 1 case _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Reading=デ
3 ランチ ランチ NOUN 名詞-普通名詞-一般 _ 6 obj _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|NP_B|Reading=ランチ
4 を を ADP 助詞-格助詞 _ 3 case _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Reading=ヲ
5 ご ご NOUN 接頭辞 _ 6 compound _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=CONT|Reading=ゴ
6 一緒 一緒 VERB 名詞-普通名詞-サ変可能 _ 0 root _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=ROOT|Reading=イッショ
7 し する AUX 動詞-非自立可能 _ 6 advcl _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Inf=サ行変格,連用形-一般|Reading=シ
8 ましょう ます AUX 助動詞 _ 6 aux _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Inf=助動詞-マス,意志推量形|Reading=マショウ
9 。 。 PUNCT 補助記号-句点 _ 6 punct _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=CONT|Reading=。
ginzame
command provides tokenization function like MeCab.
The output format of ginzame
is almost same as mecab
, but the last pronounciation
field is always '*'.
$ ginzame
銀座でランチをご一緒しましょう。
銀座 名詞,固有名詞,地名,一般,*,*,銀座,ギンザ,*
で 助詞,格助詞,*,*,*,*,で,デ,*
ランチ 名詞,普通名詞,一般,*,*,*,ランチ,ランチ,*
を 助詞,格助詞,*,*,*,*,を,ヲ,*
ご 接頭辞,*,*,*,*,*,御,ゴ,*
一緒 名詞,普通名詞,サ変可能,*,*,*,一緒,イッショ,*
し 動詞,非自立可能,*,*,サ行変格,連用形-一般,為る,シ,*
ましょう 助動詞,*,*,*,助動詞-マス,意志推量形,ます,マショウ,*
。 補助記号,句点,*,*,*,*,。,。,*
EOS
The format of spaCy's JSON is available by specifying -f 3
or -f json
for ginza
command.
$ ginza -f json
銀座でランチをご一緒しましょう。
[
{
"paragraphs": [
{
"raw": "銀座でランチをご一緒しましょう。",
"sentences": [
{
"tokens": [
{"id": 1, "orth": "銀座", "tag": "名詞-固有名詞-地名-一般", "pos": "PROPN", "lemma": "銀座", "head": 5, "dep": "obl", "ner": "B-City"},
{"id": 2, "orth": "で", "tag": "助詞-格助詞", "pos": "ADP", "lemma": "で", "head": -1, "dep": "case", "ner": "O"},
{"id": 3, "orth": "ランチ", "tag": "名詞-普通名詞-一般", "pos": "NOUN", "lemma": "ランチ", "head": 3, "dep": "obj", "ner": "O"},
{"id": 4, "orth": "を", "tag": "助詞-格助詞", "pos": "ADP", "lemma": "を", "head": -1, "dep": "case", "ner": "O"},
{"id": 5, "orth": "ご", "tag": "接頭辞", "pos": "NOUN", "lemma": "ご", "head": 1, "dep": "compound", "ner": "O"},
{"id": 6, "orth": "一緒", "tag": "名詞-普通名詞-サ変可能", "pos": "VERB", "lemma": "一緒", "head": 0, "dep": "ROOT", "ner": "O"},
{"id": 7, "orth": "し", "tag": "動詞-非自立可能", "pos": "AUX", "lemma": "する", "head": -1, "dep": "advcl", "ner": "O"},
{"id": 8, "orth": "ましょう", "tag": "助動詞", "pos": "AUX", "lemma": "ます", "head": -2, "dep": "aux", "ner": "O"},
{"id": 9, "orth": "。", "tag": "補助記号-句点", "pos": "PUNCT", "lemma": "。", "head": -3, "dep": "punct", "ner": "O"}
]
}
]
}
]
}
]
If you want to use cabocha -f1
(lattice style) like output, add -f 1
or -f cabocha
option to ginza
command.
This option's format is almost same as cabocha -f1
but the func_index
field (after the slash) is slightly different.
Our func_index
field indicates the boundary where the 自立語
ends in each 文節
(and the 機能語
might start from there).
And the functional token filter is also slightly different between cabocha -f1
and ' ginza -f cabocha
.
$ ginza -f cabocha
銀座でランチをご一緒しましょう。
* 0 2D 0/1 0.000000
銀座 名詞,固有名詞,地名,一般,,銀座,ギンザ,* B-City
で 助詞,格助詞,*,*,,で,デ,* O
* 1 2D 0/1 0.000000
ランチ 名詞,普通名詞,一般,*,,ランチ,ランチ,* O
を 助詞,格助詞,*,*,,を,ヲ,* O
* 2 -1D 0/2 0.000000
ご 接頭辞,*,*,*,,ご,ゴ,* O
一緒 名詞,普通名詞,サ変可能,*,,一緒,イッショ,* O
し 動詞,非自立可能,*,*,サ行変格,連用形-一般,する,シ,* O
ましょう 助動詞,*,*,*,助動詞-マス,意志推量形,ます,マショウ,* O
。 補助記号,句点,*,*,,。,。,* O
EOS
We added -p NUM_PROCESS
option from GiNZA v3.0.
Please specify the number of analyzing processes to NUM_PROCESS
.
You might want to use all the cpu cores for GiNZA, then execute ginza -p 0
.
The memory requirement is about 130MB/process (to be improved).
Following steps shows dependency parsing results with sentence boundary 'EOS'.
import spacy
nlp = spacy.load('ja_ginza')
doc = nlp('銀座でランチをご一緒しましょう。')
for sent in doc.sents:
for token in sent:
print(token.i, token.orth_, token.lemma_, token.pos_, token.tag_, token.dep_, token.head.i)
print('EOS')
Please see spaCy API documents for general analyzing functions. Or please refer the source codes of GiNZA on github until we'd write the documents.
The user dictionary files should be set to userDict
field of sudachi.json
in the installed package directory ofja_ginza_dict
package.
Please read the official documents to compile user dictionaries with sudachipy
command.
SudachiPy - User defined Dictionary
Sudachi User Dictionary Construction (Japanese Only)
- 2021-06-01
- Bug fix
- Issue #160: IndexError: list assignment index out of range for empty string
- 2020-10-01
- Improvements
- Add
-d
option, which disables spaCy's sentence separator, toginza
command line tool
- Add
- 2020-09-11
- Improvements
ginza
command line tool works correctly without BunsetuRecognizer in the pipeline
- 2020-09-10
- Improve bunsetu head identification accuracy over inconsistent deps in ent spans
- 2020-09-04
- Improvements
- Serialization of
CompoundSplitter
fornlp.to_disk()
- Bunsetu span detection accuracy
- Serialization of
- 2020-08-30
- Debug
- Add type arguments for singledispatch register annotations (for Python 3.6)
- 2020-08-16, Chrysoberyl
- Important changes
- Replace Japanese model with
spacy.lang.ja
of spaCy v2.3- Replace values of
Token.lemma_
with the output of SudachiPy'sMorpheme.dictionary_form()
- Replace values of
- Replace ja_ginza_dict with official SudachiDict-core package
- You can delete
ja_ginza_dict
package safety
- You can delete
- Change options and misc field contents of output of command line tool
- delete use_sentence_separator(-s)
- NE(OntoNotes) BI labels as
B-GPE
- Add subfields: Reading, Inf(inflection) and ENE(Extended NE)
- Obsolete
Token._.*
and add some entries forDoc.user_data[]
and accessors- inflections (
ginza.inflection(Token)
) - reading_forms (
ginza.reading_form(Token)
) - bunsetu_bi_labels (
ginza.bunsetu_bi_label(Token)
) - bunsetu_position_types (
ginza.bunsetu_position_type(Token)
) - bunsetu_heads (
ginza.is_bunsetu_head(Token)
)
- inflections (
- Change pipeline architecture
- JapaneseCorrector was obsoleted
- Add CompoundSplitter and BunsetuRecognizer
- Upgrade UD_JAPANESE-BCCWJ to v2.6
- Change word2vec to chiVe mc90
- Replace Japanese model with
- API Changes
- Add bunsetu-unit APIs (
from ginza import *
)- bunsetu(Token)
- phrase(Token)
- sub_phrases(Token)
- phrases(Span)
- bunsetu_spans(Span)
- bunsetu_phrase_spans(Span)
- bunsetu_head_list(Span)
- bunsetu_head_tokens(Span)
- bunsetu_bi_labels(Span)
- bunsetu_position_types(Span)
- Add bunsetu-unit APIs (
- 2020-02-12
- Debug
- Fix: degrade of cabocha mode
- 2020-01-19
- API Changes
- Extension fields
- The values of
Token._.sudachi
field would be set after callingSudachipyTokenizer.set_enable_ex_sudachi(True)
, to avoid serializtion errors
- The values of
- Extension fields
import spacy
import pickle
nlp = spacy.load('ja_ginza')
doc1 = nlp('This example will be serialized correctly.')
doc1.to_bytes()
with open('sample1.pickle', 'wb') as f:
pickle.dump(doc1, f)
nlp.tokenizer.set_enable_ex_sudachi(True)
doc2 = nlp('This example will cause a serialization error.')
doc2.to_bytes()
with open('sample2.pickle', 'wb') as f:
pickle.dump(doc2, f)
- 2020-01-16
- Important changes
- Distribute
ja_ginza_dict
from PyPI
- Distribute
- API Changes
- commands
ginza
andginzame
- add
-i
option to initialize the files ofja_ginza_dict
- add
- commands
- 2020-01-15, Benitoite
- Important changes
- Distribute
ginza
andja_ginza
from PyPI- Simple installation;
pip install ginza
, and runginza
- The model package,
ja_ginza
, is also available from PyPI.
- Simple installation;
- Model improvements
- Change NER training data-set to GSK2014-A (2019) BCCWJ edition
- Improved accuracy of NER
token.ent_type_
value is changed to Sekine's Extended Named Entity Hierarchy- Add
ENE7
attribute to the last field of the output ofginza
- Add
- Move OntoNotes5 -based label to
token._.ne
- We extended the OntoNotes5 named entity labels with
PHONE
,EMAIL
,URL
, andPET_NAME
- We extended the OntoNotes5 named entity labels with
- Overall accuracy is improved by executing
spacy pretrain
over 100 epochs- Multi-task learning of
spacy train
effectively working on UD Japanese BCCWJ
- Multi-task learning of
- The newest
SudachiDict_core-20191224
- Change NER training data-set to GSK2014-A (2019) BCCWJ edition
ginzame
- Execute
sudachipy
bymultiprocessing.Pool
and output results withmecab
like format - Now
sudachipy
command requires additional SudachiDict package installation
- Execute
- Distribute
- Breaking API Changes
- commands
ginza
(ginza.command_line.main_ginza
)- change option
mode
tosudachipy_mode
- drop options:
disable_pipes
andrecreate_corrector
- add options:
hash_comment
,parallel
,files
- add
mecab
to the choices for the argument of-f
option - add
parallel NUM_PROCESS
option (EXPERIMENTAL) - add
ENE7
attribute to conllu miscellaneous fieldginza.ent_type_mapping.ENE_NE_MAPPING
is used to convertENE7
label toNE
- change option
- add
ginzame
(ginza.command_line.main_ginzame
)- a multi-process tokenizer providing
mecab
like output format
- a multi-process tokenizer providing
- spaCy field extensions
- add
token._.ne
for ner label
- add
ginza/sudachipy_tokenizer.py
- change
SudachiTokenizer
toSudachipyTokenizer
- use
SUDACHI_DEFAULT_SPLIT_MODE
instead ofSUDACHI_DEFAULT_SPLITMODE
orSUDACHI_DEFAULT_MODE
- change
- commands
- Dependencies
- upgrade
spacy
to v2.2.3 - upgrade
sudachipy
to v0.4.2
- upgrade
- 2019-10-28
- Improvements
- JapaneseCorrector can merge the
as_*
type dependencies completely
- JapaneseCorrector can merge the
- Bug fixes
- command line tool failed at the specific situations
- 2019-10-04, Ametrine
- Important changes
split_mode
has been set incorrectly to sudachipy.tokenizer from v2.0.0 (#43)- This bug caused
split_mode
incompatibility between the training phase and theginza
command. split_mode
was set to 'B' for training phase and python APIs, but 'C' forginza
command.- We fixed this bug by setting the default
split_mode
to 'C' entirely. - This fix may cause the word segmentation incompatibilities during upgrading GiNZA from v2.0.0 to v2.2.0.
- This bug caused
- New features
- Add
-f
and--output-format
option toginza
command:-f 0
or-f conllu
: CoNLL-U Syntactic Annotation format-f 1
or-f cabocha
: cabocha -f1 compatible format
- Add custom token fields:
bunsetu_index
: bunsetu index starting from 0reading
: reading of token (not a pronunciation)sudachi
: SudachiPy's morpheme instance (or its list when then tokens are gathered by JapaneseCorrector)
- Add
- Performance improvements
- Tokenizer
- Use latest SudachiDict (SudachiDict_core-20190927.tar.gz)
- Use Cythonized SudachiPy (v0.4.0)
- Dependency parser
- Apply
spacy pretrain
command to capture the language model from UD-Japanese BCCWJ, UD_Japanese-PUD and KWDLC. - Apply multitask objectives by using
-pt 'tag,dep'
option ofspacy train
- Apply
- New model file
- ja_ginza-2.2.0.tar.gz
- Tokenizer
- 2019-07-08
- Add
ginza
command- run
ginza
from the console
- run
- Change package structure
- module package as
ginza
- language model package as
ja_ginza
spacy.lang.ja
is overridden byginza
- module package as
- Remove
sudachipy
related directories- SudachiPy and its dictionary are installed via
pip
duringginza
installation
- SudachiPy and its dictionary are installed via
- User dictionary available
- Token extension fields
- Added
token._.bunsetu_bi_label
,token._.bunsetu_position_type
- Remained
token._.inf
- Removed
pos_detail
(same value is set totoken.tag_
)
- Added
- 2019-04-07
- Set depending token index of root as 0 to meet with conllu format definitions
- 2019-04-02
- Add new Japanese era 'reiwa' to system_core.dic.
- 2019-04-01
- First release version
$ git clone 'https://github.com/megagonlabs/ginza.git'
For normal environment:
$ python setup.py develop
Copy system.dic
from installed package directory of ja_ginza_dict
to ./ja_ginza_dict/sudachidict/
.
The analysis model of GiNZA is trained by spacy train
command.
$ python -m spacy train ja ja_ginza-4.0.0 corpus/ja_ginza-ud-train.json corpus/ja_ginza-ud-dev.json -b ja_vectors_chive_mc90_35k/ -ovl 0.3 -n 100 -m meta.json.ginza -V 4.0.0