/Lingua-JA-TermExtractor

Term Extractor

Primary LanguageSmalltalk

NAME
    Lingua::JA::TermExtractor - Term Extractor

SYNOPSIS
      use Lingua::JA::TermExtractor;
      use utf8;
      use feature qw/say/;
      use Data::Printer;

      my $extractor = Lingua::JA::TermExtractor->new(
          df_file           => './df.tch', # Please download from http://misc.pawafuru.com/webidf/.
          pos1_filter       => [qw/非自立 代名詞 ナイ形容詞語幹 副詞可能 サ変接続/],
          ng_word           => [qw/編集 本人 自身 自分 たち さん/],
      );

      p $extractor->extract($document)->dump;
      p $extractor->extract(\@documents)->dump;

      for my $result (@{ $extractor->extract(\@documents)->list(50) })
      {
          my ($word, $score) = each %{$result};

          say "$word: $score";
      }

DESCRIPTION
    Lingua::JA::TermExtractor is a term extractor. This extracts terms from
    one or more documents and sorts them based on their TF*WebIDF or BM25
    scores.

METHODS
  new( %config || \%config )
    Creates a new Lingua::JA::TermExtractor instance.

    The following configuration is used if you don't set %config.

      KEY                 DEFAULT VALUE
      -----------         ---------------
      k1                  2.0
      b                   0.75

      pos1_filter         [qw/非自立 代名詞 ナイ形容詞語幹 副詞可能/]
      pos2_filter         []
      pos3_filter         []
      ng_word             []
      term_length_min     2
      term_length_max     30
      concat_max          30
      tf_min              1
      df_min              0
      df_max              250_0000_0000
      fetch_unk_word_df   0
      db_auto             1
      guess_df            1

      idf_type            3
      api                 'YahooPremium'
      appid               undef
      driver              'TokyoCabinet'
      df_file             './df.tch'
      fetch_df            0
      expires_in          365
      documents           250_0000_0000
      Furl_HTTP           undef
      verbose             0

    k1 => $weight
        The weight of term frequency(TF).

    b => $weight
        The weight of document length normalization.

    pos(1|2|3)_filter, ng_word, term_length_(min|max), concat_max, tf_min,
    df_(min|max), fetch_unk_word_df, db_auto, guess_df
        See Lingua::JA::TFWebIDF.

    idf_type, api, appid, driver, df_file, fetch_df, expires_in, documents,
    Furl_HTTP, verbose
        See Lingua::JA::WebIDF.

  extract( $document || \@documents )
    Extracts terms from $document or \@documents and sorts them based on
    their TF*WebIDF or BM25 scores.

    If $document, TF*WebIDF is used. If \@documents, BM25 is used.

    Word segmentation and POS tagging are done via MeCab.

  tfidf, tf
    See Lingua::JA::TFWebIDF.

  idf, df, purge, db_open, db_close
    See Lingua::JA::WebIDF.

AUTHOR
    pawa <pawapawa@cpan.org>

SEE ALSO
    Lingua::JA::TFWebIDF

    Lingua::JA::WebIDF

    Lingua::JA::WebIDF::Driver::TokyoTyrant

LICENSE
    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.