NAME
Text::Shirasu - Text::MeCab, Text::CaboCha wrapped for natural language processing
SYNOPSIS
use utf8;
use feature ':5.10';
use Text::Shirasu;
my $ts = Text::Shirasu->new; # this parameter same as Text::MeCab
my $normalize = $ts->normalize("昨日の晩御飯は「鮭のふりかけ」と「味噌汁」だけでした。");
$ts->parse($normalize);
for my $node (@{ $ts->nodes }) {
say $node->surface;
}
say $ts->join_surface;
my $filter = $ts->filter(type => [qw/名詞 助動詞/], 記号 => [qw/括弧開 括弧閉/]);
say $filter->join_surface;
$ts->load_cabocha(); # This method loads Text::CaboCha object. this parameter same as Text::CaboCha
$ts->parse_cabocha("今日の晩御飯も「鮭のふりかけ」と「味噌汁」だけでした。");
for my $node (@{ $ts->cabocha_nodes }) {
say $node->surface;
}
DESCRIPTION
Text::Shirasu is wrapped Text::MeCab.
This module is easy to normalize text and filter part of speech.
METHODS
parse
This method wraps the parse method of Text::MeCab.
The analysis result is saved as Text::Shirasu::Node instance in the Text::Shirasu instance. So, It will return Text::Shirasu instance.
$ts->parse("このおにぎりは「母」が握ってくれたものです。");
parse_cabocha
This method wraps the parse method of Text::CaboCha. The analysis result is saved as Text::Shirasu::CaboChaNode instance in the Text::Shirasu instance. So, It will return Text::Shirasu instance.
$ts->parse_cabocha("このおにぎりも「母」が握ってくれたものです。");
normalize
It will normalize text using Lingua::JA::NormalizeText.
$ts->normalize("あ━ ”(*)” を〰〰 ’+1’")
$ts->normalize("テキスト〰〰", qw/nfkc, alnum_z2h/, \&your_create_routine)
It accepts a string as the first argument, and receives the Lingua::JA::NormalizeText options and subroutines after the second argument. If you do not specify a subroutine to be used in normalization, use the following Lingua::JA::NormalizeText options and subroutines by default.
Please read the documentation of Lingua::JA::NormalizeText for details on how each Lingua::JA::NormalizeText option works.
Lingua::JA::NormalizeText options
nfkc nfkd nfc nfd alnum_z2h space_z2h katakana_h2z decode_entities unify_nl unify_whitespaces unify_long_spaces trim old2new_kana old2new_kanji tab2space all_dakuon_normalize square2katakana circled2kana circled2kanji decompose_parenthesized_kanji
Subroutines
normalize_hyphen normalize_symbols
filter
Please use after parse method execution.
Filter the surface based on the features stored in the Text::Shirasu instance.
Passing subtype to value with part of speech name as key allows you to more filter the string.
$ts->filter(type => [qw/名詞/]);
$ts->filter(type => [qw/名詞 記号/], 記号 => [qw/括弧開 括弧閉/]);
filter_cabocha
This method filters by POS tag from cabocha_nodes as like filter method.
$ts->filter_cabocha(type => [qw/名詞/]);
$ts->filter_cabocha(type => [qw/名詞 記号/], 記号 => [qw/括弧開 括弧閉/]);
join_surface
Returns a string that combined the surfaces stored in the instance.
$ts->join_surface
join_surface_cabocha
Returns a string that combined the surfaces stored in the instance (cabocha).
$ts->join_surface_cabocha
nodes
Return the array reference of the Text::Shirasu::Node instance.
$ts->nodes
cabocha_nodes
Return the array reference of the Text::Shirasu::CaboChaNode instance.
$ts->cabocha_nodes
mecab
Return the Text::MeCab instance.
$ts->mecab
cabocha
Return the Text::CaboCha instance.
$ts->cabocha
SUBROUTINES
These subroutines perform the following substitution.
normalize_hyphen
s/[˗֊‐‑‒–⁃⁻₋−]/-/g;
s/[﹣-ー—―─━ー]/ー/g;
s/[~∼∾〜〰~]//g;
s/ー+/ー/g;
normalize_symbols
tr/。、・「」/。、・「」/;
LICENSE
Copyright (C) Kei Kamikawa(Code-Hex).
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
AUTHOR
Kei Kamikawa x00.x7f@gmail.com