infinilabs/analysis-pinyin

首字母搜索,mec不能搜索木耳草

Opened this issue · 3 comments

索引配置

"analyzer": {
        "pinyin_analyzer": {
             "tokenizer": "my_pinyin"
        }
      }

 "tokenizer": {
       "my_pinyin": {
          "lowercase": "true",
          "keep_original": "false",
          "keep_first_letter": "true",
          "keep_separate_first_letter": "true",
          "type": "pinyin",
          "limit_first_letter_length": "64",
          "keep_full_pinyin": "true"
        }


 "properties": {
      "name": {
          "type": "keyword",
            "py": {
              "type": "text",
              "analyzer": "pinyin_analyzer",
              "search_analyzer": "pinyin_analyzer"
            } 
        }
  }

index time(木耳草)

{
"tokens": [
    {
        "token": "m",
        "start_offset": 0,
        "end_offset": 0,
        "type": "word",
        "position": 0
    },
    {
        "token": "mu",
        "start_offset": 0,
        "end_offset": 0,
        "type": "word",
        "position": 0
    },
    {
        "token": "e",
        "start_offset": 0,
        "end_offset": 0,
        "type": "word",
        "position": 1
    },
    {
        "token": "er",
        "start_offset": 0,
        "end_offset": 0,
        "type": "word",
        "position": 1
    },
    {
        "token": "c",
        "start_offset": 0,
        "end_offset": 0,
        "type": "word",
        "position": 2
    },
    {
        "token": "cao",
        "start_offset": 0,
        "end_offset": 0,
        "type": "word",
        "position": 2
    },
    {
        "token": "mec",
        "start_offset": 0,
        "end_offset": 0,
        "type": "word",
        "position": 2
    }
]}

search time (mec)

{
"tokens": [
    {
        "token": "me",
        "start_offset": 0,
        "end_offset": 0,
        "type": "word",
        "position": 0
    },
    {
        "token": "c",
        "start_offset": 0,
        "end_offset": 0,
        "type": "word",
        "position": 1
    },
    {
        "token": "mec",
        "start_offset": 0,
        "end_offset": 0,
        "type": "word",
        "position": 1
    }
] }

搜索时mec分词结果中包含me,使用phrase query检索时,检索不出来。有没有解决方案??

medcl commented

pinyin 如果产生多个重复的位置重叠的 term,本来就不适合 phrase 查询。换普通的查询应该是可以的,查询和索引都有分出 term:mec,应该可以查询出来的,

@medcl
感谢回复。
使用best_fields替换phrase后,命中范围有点大,一些不相干的结果都出来了
如果指定search 的analyzer为keyword_analyzer,可以搜出来,解决了当前场景的问题,但是会引入其他问题,例如搜muer就不行了,有点难搞哦

我用示例里的medcl3,
POST /medcl3/_doc/lucy {"name":"敏感的心"}
发现搜索mingan,会搜出ming/an, min/gan都不到;但是分词里是有min, gan,搜索mg是可以的
这个怎么解决
GET /medcl3/_validate/query?explain { "query": {"match": { "name.pinyin": "mingan" }} }