自定义词条的中英数混合分词问题

Question

自定义词条的中英数混合分词问题

Closed this issue 6 years ago · 6 comments

自定义了一个词库，下面是词条内容

WORD TF IDF ATTR

京基ab 1.00 1.00 @^@
京基1 1.00 1.00 @^@
京基a 1.00 1.00 @^@
京基1ab 1.00 1.00 @^@
京基1a 1.00 1.00 @^@
京基100 1.00 1.00 @^@

测试代码：

set_charset('utf8'); //编码 $so->set_dict('/home/ira/www/farm.ira.orantrip.com/tmp/article/all.xdb'); $so->set_ignore(false); $so->set_ignore(true); //忽略标点符号 $so->send_text($text); print_r($so->get_words('@')); ?>

回传内容：
Array
(
[0] => Array
(
[word] => 京基1
[times] => 1
[weight] => 1
[attr] => @
)

[1] => Array
    (
        [word] => 京基a
        [times] => 1
        [weight] => 1
        [attr] => @
    )

[2] => Array
    (
        [word] => 京基1a
        [times] => 1
        [weight] => 1
        [attr] => @
    )

[3] => Array
    (
        [word] => 京基ab
        [times] => 1
        [weight] => 1
        [attr] => @
    )

)

需要被分词出来的京基100没有被分出来，英文数字总和大于2的词条也没有被分出，是否有什么设置可以处理这个问题？谢谢。

ljx0517 commented 8 years ago

+1

Answer 1 · 2016-02-29T14:29:24.000Z

文档有说过，中英混合的最多只支持2个字符；超过2个，单独切分就可以了没必要组在一起了。

Best Regards

hightman/海鳗

微信/微博：hightman
Github：https://github.com/hightman

在 2016年2月29日，下午10:16，iracheng notifications@github.com 写道：

自定义了一个词库，下面是词条内容

WORD TF IDF ATTR

京基ab 1.00 1.00 @^@
京基1 1.00 1.00 @^@
京基a 1.00 1.00 @^@
京基1ab 1.00 1.00 @^@
京基1a 1.00 1.00 @^@
京基100 1.00 1.00 @^@

测试代码：
set_charset('utf8'); //编码 $so->set_dict('/home/ira/www/farm.ira.orantrip.com/tmp/article/all.xdb'); $so->set_ignore(false); $so->set_ignore(true); //忽略标点符号 $so->send_text($text); print_r($so->get_words('@')); ?>
回传内容：
Array
(
[0] => Array
(
[word] => 京基1
[times] => 1
[weight] => 1
[attr] => @
)

[1] => Array
(
[word] => 京基a
[times] => 1
[weight] => 1
[attr] => @
)

[2] => Array
(
[word] => 京基1a
[times] => 1
[weight] => 1
[attr] => @
)

[3] => Array
(
[word] => 京基ab
[times] => 1
[weight] => 1
[attr] => @
)
)

需要被分词出来的京基100没有被分出来，英文数字总和大于2的词条也没有被分出，是否有什么设置可以处理这个问题？谢谢。

—
Reply to this email directly or view it on GitHub #29.

Answer 2 · 2016-03-01T07:31:51.000Z

目的是想要分析地名或是建築物的名稱，如果切分的話無法判斷目標的內容是否有出現，像是「昂坪360」、「天际100」、「京基100」，實現搜索的比對沒辦法對應出來，是否有設置能夠擴充支持的字符數量？謝謝。

Answer 3 · 2016-03-01T07:34:11.000Z

目前没有。

Best Regards

hightman/海鳗

微信/微博：hightman
Github：https://github.com/hightman

在 2016年3月1日，下午3:31，iracheng notifications@github.com 写道：

目的是想要分析地名或是建築物的名稱，如果切分的話無法判斷目標的內容是否有出現，像是「昂坪360」、「天际100」、「京基100」，實現搜索的比對沒辦法對應出來，是否有設置能夠擴充支持的字符數量？謝謝。

—
Reply to this email directly or view it on GitHub #29 (comment).

Answer 4 · 2016-10-25T03:38:37.000Z

应该以自定义词典优先级为准吧？中英文混编的词也很多的，比如：好123，4399游戏，300英雄，163邮箱，2016传奇，荣威550，本田XR-V，大众Polo，神仙道2016，小米note，Wifi万能钥匙，量贩ktv
如果这些词出现在字典里，感觉应该要识别出来才对
另外还有个问题就是不支持空格，比如 iphone 6s，小米5s Plus，等等。。希望能改进支持。。

Answer 5 · 2018-10-17T07:07:31.000Z

应该以自定义词典优先级为准吧？中英文混编的词也很多的，比如：好123，4399游戏，300英雄，163邮箱，2016传奇，荣威550，本田XR-V，大众Polo，神仙道2016，小米note，Wifi万能钥匙，量贩ktv
如果这些词出现在字典里，感觉应该要识别出来才对
另外还有个问题就是不支持空格，比如 iphone 6s，小米5s Plus，等等。。希望能改进支持。。

觉得意义不大，4399游戏切成4399+游戏也不影响搜索