hightman/scws

你好,hightman,请问下我使用PHP添加自定义词组时,报错?

Opened this issue · 6 comments

692 $so = scws_new();
693 $so->set_charset('utf8');
694 // 这里没有调用 set_dict 和 set_rule 系统会自动试调用 ini 中指定路径下的词典和规则文件
695 //$dictPath = ini_get('scws.default.fpath').'/dict.utf8.xdb';
696 //$so->set_dict($dictPath);//设置词典
697
698 //$so->set_dict('/usr/local/scws/etc/dict.utf8.xdb');
699 $so->add_dict('/usr/local/scws/etc/dict.user.txt');
700 //$so->set_rule('/usr/local/scws/etc/rules.utf8.ini');
701
702 $so->set_duality(true);//设定是否将闲散文字自动以二字分词法聚合。
703 $so->set_ignore(true);//设定分词返回结果时是否去除一些特殊的标点符号之类。
704 $so->set_multi(1);//按位异或的 1 | 2 | 4 | 8 分别表示: 短词 | 二元 | 主要单字 | 所有单字
705
706 $so->send_text("我是一个**人,我会C++语言,我也有很多T恤衣服,我的衣服比我还重老司机遇上新能源遇上新能源这个分词怎么分");
707 echo '<pre>';
708 //$tmp = $so->get_result();
709 //$tmp = $so->get_tops(6, '~V');
710 $tmp = $so->get_tops(7);
711 foreach($tmp as $v)
712 {
713 print_r($v);
714 }
715 $so->close();

总是在 报699行 $so->add_dict('/usr/local/scws/etc/dict.user.txt'); 错误,我想添加一些自定义的词组:老司机。

请问是哪里出了问题呢?

谢谢

知道了,是加一个SCWS_XDICT_TXT参数就OK了。

再问一个问题:怎样去掉一些语气助词还有某些不可能用的词:

Array
(
[word] => 收入
[times] => 4
[weight] => 19.559999465942
[attr] => n
)
Array
(
[word] => 可以
[times] => 4
[weight] => 18.680000305176
[attr] => v
)
Array
(
[word] => 返利
[times] => 2
[weight] => 16.979999542236
[attr] => v
)
Array
(
[word] => 不仅
[times] => 3
[weight] => 14.849999427795
[attr] => c
)
Array
(
[word] => 也许
[times] => 3
[weight] => 14.819999694824
[attr] => d
)
Array
(
[word] => 他们
[times] => 3
[weight] => 14.760000228882
[attr] => r
)
Array
(
[word] => 拥有
[times] => 3
[weight] => 14.700000762939
[attr] => v
)
Array
(
[word] => 优惠
[times] => 3
[weight] => 14.549999237061
[attr] => vn
)
Array
(
[word] => 如果
[times] => 3
[weight] => 14.460000991821
[attr] => c
)
Array
(
[word] => 财富
[times] => 3
[weight] => 14.400000572205
[attr] => n
)
Array
(
[word] => 消费
[times] => 3
[weight] => 14.130000114441
[attr] => vn
)
Array
(
[word] => 自己
[times] => 3
[weight] => 13.650000572205
[attr] => r
)

像这篇文章分词结果中的:如果、自己、不仅、也许、他们……排除掉呢???

我在词性里加入了:$tmp = $so->get_tops(100, '~v,~d,~y,~e,~r,~a'); 没用。
Array
(
[word] => 不仅
[times] => 3
[weight] => 14.849999427795
[attr] => c
)
Array
(
[word] => 也许
[times] => 3
[weight] => 14.819999694824
[attr] => d
)
Array
(
[word] => 他们
[times] => 3
[weight] => 14.760000228882
[attr] => r
)
Array
(
[word] => 如果
[times] => 3
[weight] => 14.460000991821
[attr] => c
)
Array
(
[word] => 财富
[times] => 3
[weight] => 14.400000572205
[attr] => n
)
Array
(
[word] => 消费
[times] => 3
[weight] => 14.130000114441
[attr] => vn
)
Array
(
[word] => 自己
[times] => 3
[weight] => 13.650000572205
[attr] => r
)
需要大侠指点一下, 哪里设置的不对?