/hive-udf-neologd

Hive Japanese NLP UDFs with NEologd

Primary LanguageJavaApache License 2.0Apache-2.0

Hive Japanese NLP UDFs with NEologd

Build Status

This package extends Hivemall's Japanese NLP capability by utilizing NEologd.

Before getting started, build the latest version of hivemall-all-{HIVEMALL_VERSION}.jar as documented on Hivemall installation guide.

Usage

Run build script:

./build.sh

The build script is modified version of kazuhira-r/kuromoji-with-mecab-neologd-buildscript.

Use the UDFs on Hive:

add jar hivemall-all-{HIVEMALL_VERSION}.jar; -- e.g., hivemall-all-0.5.1-incubating-SNAPSHOT.jar
add jar hive-udf-neologd-{VERSION}-{NEOLOGD_VERSION_DATE}.jar; -- e.g., hive-udf-neologd-0.1.0-20180524.jar;
create temporary function tokenize_ja_neologd as 'hivemall.nlp.tokenizer.KuromojiNEologdUDF';
select tokenize_ja_neologd();
-- ["{VERSION}-{NEOLOGD_VERSION_DATE}"]
select tokenize_ja_neologd('10日放送の「中居正広のミになる図書館」(テレビ朝日系)で、SMAPの中居正広が、篠原信一の過去の勘違いを明かす一幕があった。');
-- ["10日","放送","中居正広の身になる図書館","テレビ朝日","系","smap","中居正広","篠原信一","過去","勘違い","明かす","一幕"]