atilika/kuromoji

Normalized surface in user dictionary.

mrikitoku opened this issue · 5 comments

in current implementations of the ipadic, it seems that there is no functionality to normalize surface in the user dic.
is this right?

i think that this functionality is very useful and required in common situations.

so, i have a plan to expand user dictionary function to handle normalize a word surface with keeping the current specification of the user dictionary resource format.

what do you think about this?

cmoen commented

Thanks! Could you give an example of what kind of normalisations you'd like to see?

I'm wondering if we might already support it in the full/expanded user dictionary format in 1.0-SNAPSHOT.

Token class has the getBaseForm method I regard as a kind of surface normalization as you know well.

by using this method, we can get normalized surface if we register base form for each morpheme.
like this.

public static void execute() {
        Tokenizer.Builder builder = new Tokenizer.Builder();
        builder.mode(TokenizerBase.Mode.SEARCH);
        String text = "プログラミングの入門書を書いている.";
        Tokenizer tagger = builder.build();

        List<Token> tokens = tagger.tokenize(text);
        for (Token t : tokens) {
            out.println(String.format("t.getSsurface(): %s", t.getSurface()));
            out.println(String.format("t.getBaseForm(): %s", t.getBaseForm()));
            out.println(" " + t.getAllFeatures());
        }
    }
   ...
   >t.getSsurface(): 書い
   >t.getBaseForm(): 書く

but, on the current implementations of ipadic user dictionary, it seems that there are no means to register the base form for each user dictionary word. instead of base form , we can register the reading and splitted surface.

in current implementation about UserDictionary, SIMPLE_USERDICT_FIELDS is set to 4 as follows

SIMPLE_USERDICT_FIELDS = 4;

simple userdict fields means the following fields

 String surface = values[0];
 String segmentationValue = values[1];
 String readingsValue = values[2];
 String partOfSpeech = values[3];

i think that base form is more needed in usual usecases.
so, i have a plan to add fifth field regarding the base form.

or

letting the segmentationValue handle base form.

length of the segmentationValue without spaces must equals to the length of surface because term splitting is executed by using offset and length of splitted word ?

@Test
    public void testBaseForm() throws IOException {
        String userDictionary = "NAIST,奈良先端科学技術大学院大学,naisuto,meisi";
        Tokenizer tokenizer = makeTokenizer(userDictionary);
        String input = "大学の略称はNAIST(ナイスト)、奈良先端大学";

        List<Token> tokens = tokenizer.tokenize(input);
        for (Token t : tokens) {
            System.out.println("----");
            System.out.println(" surface:" + t.getSurface() );
            System.out.println(" base form:" + t.getBaseForm() );
            System.out.println(" reading:" + t.getReading());
            System.out.println(" class:" + t.getPartOfSpeechLevel1());
            System.out.println(" " + t.getAllFeatures());
        }
    }
----
 surface:大学
 base form:大学
 reading:ダイガク
 class:名詞
 名詞,一般,*,*,*,*,大学,ダイガク,ダイガク
----
 surface:の
 base form:の
 reading:ノ
 class:助詞
 助詞,連体化,*,*,*,*,の,ノ,ノ
----
 surface:略称
 base form:略称
 reading:リャクショウ
 class:名詞
 名詞,サ変接続,*,*,*,*,略称,リャクショウ,リャクショー
----
 surface:は
 base form:は
 reading:ハ
 class:助詞
 助詞,係助詞,*,*,*,*,は,ハ,ワ
----
 surface:NAIST
 base form:奈良先端科学技術大学院大学
 reading:naisuto
 class:meisi
 meisi,*,*,*,*,*,奈良先端科学技術大学院大学,naisuto,*
----
 surface:(
 base form:(
 reading:(
 class:記号
 記号,括弧開,*,*,*,*,(,(,(
----
 surface:ナイスト
 base form:*
 reading:*
 class:名詞
 名詞,固有名詞,一般,*,*,*,*,*,*
----
 surface:)
 base form:)
 reading:)
 class:記号
 記号,括弧閉,*,*,*,*,),),)
----
 surface:、
 base form:、
 reading:、
 class:記号
 記号,読点,*,*,*,*,、,、,、
----
 surface:奈良先端大
 base form:奈良先端大
 reading:ナラセンタンダイ
 class:名詞
 名詞,固有名詞,組織,*,*,*,奈良先端大,ナラセンタンダイ,ナラセンタンダイ
----
 surface:学
 base form:学
 reading:ガク
 class:名詞
 名詞,接尾,一般,*,*,*,学,ガク,ガク