Mmseg Analysis for ElasticSearch
==========
The Mmseg Analysis plugin integrates Lucene mmseg4j-analyzer:http://code.google.com/p/mmseg4j/ into elasticsearch, support customized dictionary.
The plugin includes a `mmseg` analyzer and a `mmseg` tokenizer.
Version
-—————
master | 1.0.0 → master
1.2.2 | 1.0.0
1.2.1 | 0.90.2
1.2.0 | 0.90.0
1.1.2 | 0.20.1
1.1.1 | 0.19.x
Install
-—————
you can download this plugin from RTF project(https://github.com/medcl/elasticsearch-rtf)
https://github.com/medcl/elasticsearch-rtf/tree/master/plugins/analysis-mmseg
download the dict files,unzip these dict file to your elasticsearch’s config folder,such as: your-es-root/config/mmseg
https://github.com/medcl/elasticsearch-rtf/tree/master/config/mmseg
you need a service restart after that!
Analysis Configuration (elasticsearch.yml)
-—————
index: analysis: analyzer: mmseg: alias: [news_analyzer, mmseg_analyzer] type: org.elasticsearch.index.analysis.MMsegAnalyzerProvider index.analysis.analyzer.default.type : "mmseg"
additional parameters that can be used to customize the mmseg tokenizer
index: analysis: tokenizer: mmseg_maxword: type: mmseg seg_type: "max_word" mmseg_complex: type: mmseg seg_type: "complex" mmseg_simple: type: mmseg seg_type: "simple"
Mapping Configuration
-—————
Here is a quick example:
1.create a index
curl -XPUT http://localhost:9200/index
2.create a mapping
curl -XPOST http://localhost:9200/index/fulltext/_mapping -d' { "fulltext": { "_all": { "indexAnalyzer": "mmseg", "searchAnalyzer": "mmseg", "term_vector": "no", "store": "false" }, "properties": { "content": { "type": "string", "store": "no", "term_vector": "with_positions_offsets", "indexAnalyzer": "mmseg", "searchAnalyzer": "mmseg", "include_in_all": "true", "boost": 8 } } } }'
3.indexing some docs
curl -XPOST http://localhost:9200/index/fulltext/1 -d' {content:"美国留给伊拉克的是个烂摊子吗"} ' curl -XPOST http://localhost:9200/index/fulltext/2 -d' {content:"公安部:各地校车将享最高路权"} ' curl -XPOST http://localhost:9200/index/fulltext/3 -d' {content:"中韩渔警冲突调查:韩警平均每天扣1艘**渔船"} ' curl -XPOST http://localhost:9200/index/fulltext/4 -d' {content:"**驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"} '
4.query with highlighting
curl -XPOST http://localhost:9200/index/fulltext/_search -d' { "query" : { "term" : { "content" : "**" }}, "highlight" : { "pre_tags" : ["<tag1>", "<tag2>"], "post_tags" : ["</tag1>", "</tag2>"], "fields" : { "content" : {} } } } '
here is the query result
{ "took": 14, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 2, "max_score": 2, "hits": [ { "_index": "index", "_type": "fulltext", "_id": "4", "_score": 2, "_source": { "content": "**驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首" }, "highlight": { "content": [ "<tag1>**</tag1>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首 " ] } }, { "_index": "index", "_type": "fulltext", "_id": "3", "_score": 2, "_source": { "content": "中韩渔警冲突调查:韩警平均每天扣1艘**渔船" }, "highlight": { "content": [ "均每天扣1艘<tag1>**</tag1>渔船 " ] } } ] } }
have fun.