ElasticSearch基础

Question

ElasticSearch基础

rico-c opened this issue 5 years ago · 0 comments

ElasticSearch

基础

在Elasticsearch中存储数据的行为就叫做索引(indexing)
结构对比传统数据库

MySQL         -> Databases -> Tables    -> Rows      -> Columns
Elasticsearch -> Indices   -> Types     -> Documents -> Fields

存入数据

PUT /megacorp/employee/1
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}

名字	说明
megacorp	索引名(indces)
employee	类型名(types)
1	这个员工的ID(Documents)

获取数据

GET /megacorp/employee/1

{
  "_index" :   "megacorp",
  "_type" :    "employee",
  "_id" :      "1",
  "_version" : 1,
  "found" :    true,
  "_source" :  {
      "first_name" :  "John",
      "last_name" :   "Smith",
      "age" :         25,
      "about" :       "I love to go rock climbing",
      "interests":  [ "sports", "music" ]
  }
}

搜索数据

GET /megacorp/employee/_search使用关键字_search来取代原来的文档ID, 默认情况下搜索会返回前10个结果

{
   "took":      6,
   "timed_out": false,
   "_shards": { ... },
   "hits": {
      "total":      3,
      "max_score":  1,
      "hits": [
         {
            "_index":         "megacorp",
            "_type":          "employee",
            "_id":            "3",
            "_score":         1,
            "_source": {
               "first_name":  "Douglas",
               "last_name":   "Fir",
               "age":         35,
               "about":       "I like to build cabinets",
               "interests": [ "forestry" ]
            }
         },
         {
            "_index":         "megacorp",
            "_type":          "employee",
            "_id":            "1",
            "_score":         1,
            "_source": {
               "first_name":  "John",
               "last_name":   "Smith",
               "age":         25,
               "about":       "I love to go rock climbing",
               "interests": [ "sports", "music" ]
            }
         },
         {
            "_index":         "megacorp",
            "_type":          "employee",
            "_id":            "2",
            "_score":         1,
            "_source": {
               "first_name":  "Jane",
               "last_name":   "Smith",
               "age":         32,
               "about":       "I like to collect rock albums",
               "interests": [ "music" ]
            }
         }
      ]
   }
}

**查询字符串(query string)**搜索

GET /megacorp/employee/_search?q=last_name:Smith到所有姓氏为Smith的结果

**DSL(Domain Specific Language特定领域语言)**搜索

GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "last_name" : "Smith"
        }
    }
}

GET /megacorp/employee/_search
{
    "query" : {
        "filtered" : {
            "filter" : {
                "range" : {
                    "age" : { "gt" : 30 } <1>
                }
            },
            "query" : {
                "match" : {
                    "last_name" : "smith" <2>
                }
            }
        }
    }
}

想要确切的匹配若干个单词或者短语(phrases)

例如我们想要查询同时包含"rock"和"climbing"（并且是相邻的）的员工记录。

要做到这个，我们只要将match查询变更为match_phrase查询即可:
高亮

GET /megacorp/employee/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    },
    "highlight": {
        "fields" : {
            "about" : {}
        }
    }
}

{
   ...
   "hits": {
      "total":      1,
      "max_score":  0.23013961,
      "hits": [
         {
            ...
            "_score":         0.23013961,
            "_source": {
               "first_name":  "John",
               "last_name":   "Smith",
               "age":         25,
               "about":       "I love to go rock climbing",
               "interests": [ "sports", "music" ]
            },
            "highlight": {
               "about": [
                  "I love to go <em>rock</em> <em>climbing</em>" <1>
               ]
            }
         }
      ]
   }
}

聚合

想知道所有姓"Smith"的人最大的共同点（兴趣爱好）

GET /megacorp/employee/_search
{
  "query": {
    "match": {
      "last_name": "smith"
    }
  },
  "aggs": {
    "all_interests": {
      "terms": {
        "field": "interests"
      }
    }
  }
}

...
  "all_interests": {
     "buckets": [
        {
           "key": "music",
           "doc_count": 2
        },
        {
           "key": "sports",
           "doc_count": 1
        }
     ]
  }

分级聚合

GET /megacorp/employee/_search
{
    "aggs" : {
        "all_interests" : {
            "terms" : { "field" : "interests" },
            "aggs" : {
                "avg_age" : {
                    "avg" : { "field" : "age" }
                }
            }
        }
    }
}

分片和集群

分片定义

一个分片(shard)是一个最小级别“工作单元(worker unit)”,它只是保存了索引中所有数据的一部分。在接下来的《深入分片》一章，我们将详细说明分片的工作原理，但是现在我们只要知道分片就是一个Lucene实例，并且它本身就是一个完整的搜索引擎。我们的文档存

工作

三个**复制分片(replica shards)**也已经被分配了——分别对应三个主分片，这意味着在丢失任意一个节点的情况下依旧可以保证数据的完整性。

文档的索引将首先被存储在主分片中，然后并发复制到对应的复制节点上。这可以确保我们的数据在主节点和复制节点上都可以被检索。

故障转移

只要第二个节点与第一个节点有相同的cluster.name（请看./config/elasticsearch.yml文件），它就能自动发现并加入第一个节点所在的集群

数据

一个文档不只有数据。它还包含了元数据(metadata)——关于文档的信息。三个必须的元数据节点是：

`_index`	文档存储的地方
`_type`	文档代表的对象的类
`_id`	文档的唯一标识

索引名称名字必须是全部小写，不能以下划线开头，不能包含逗号
检索文档的一部分

GET /website/blog/123?_source=title,text

{
  "_index" :   "website",
  "_type" :    "blog",
  "_id" :      "123",
  "_version" : 1,
  "exists" :   true,
  "_source" : {
      "title": "My first blog entry" ,
      "text":  "Just trying this out..."
  }
}

如果你想做的只是检查文档是否存在——你对内容完全不感兴趣——使用HEAD方法来代替GET。HEAD请求不会返回响应体，只有HTTP头：

更新文档

文档在Elasticsearch中是不可变的，我们可以使用《索引文档》章节提到的index API 重建索引(reindex) 或者替换掉它。

PUT /website/blog/123
{
  "title": "My first blog entry",
  "text":  "I am starting to get the hang of this...",
  "date":  "2014/01/02"
}

在响应中，我们可以看到Elasticsearch把_version增加了。

{
  "_index" :   "website",
  "_type" :    "blog",
  "_id" :      "123",
  "_version" : 2,
  "created":   false <1>
}

局部更新

POST /website/blog/1/_update
{
   "doc" : {
      "tags" : [ "testing" ],
      "views": 0
   }
}

使用脚本更新

POST /website/blog/1/_update
{
   "script" : "ctx._source.tags+=new_tag",
   "params" : {
      "new_tag" : "search"
   }
}

保证PUT创建文档而不覆盖

第一种方法使用op_type查询参数：

PUT /website/blog/123?op_type=create
{ ... }

或者第二种方法是在URL后加/_create做为端点：

PUT /website/blog/123/_create
{ ... }

如果请求成功的创建了一个新文档，Elasticsearch将返回正常的元数据且响应状态码是201 Created。

另一方面，如果包含相同的_index、_type和_id的文档已经存在，Elasticsearch将返回409 Conflict响应状态码，错误信息类似如下：

{
  "error" : "DocumentAlreadyExistsException[[website][4] [blog][123]:
             document already exists]",
  "status" : 409
}

删除文档

DELETE /website/blog/123

避免版本冲突

假设冲突不经常发生，也不区块化访问，然而，如果在读写过程中数据发生了变化，更新操作将失败。这时候由程序决定在失败后如何解决冲突。

上文我们提到index、get、delete请求时，我们指出每个文档都有一个_version号码，这个号码在文档被改变时加一。Elasticsearch使用这个_version保证所有修改都被正确排序。当一个旧版本出现在新版本之后，它会被简单的忽略。

指定一个新version号码为10

PUT /website/blog/2?version=10&version_type=external
{
  "title": "My first external blog entry",
  "text":  "This is a piece of cake..."
}

搜索

类型

search API有两种表单：一种是“简易版”的查询字符串(query string)将所有参数通过查询字符串定义，另一种版本使用JSON完整的表示请求体(request body)

跨索引、类型搜索

/_search

在所有索引的所有类型中搜索

/gb/_search

在索引gb的所有类型中搜索

/gb,us/_search

在索引gb和us的所有类型中搜索

/g*,u*/_search

在以g或u开头的索引的所有类型中搜索

/gb/user/_search

在索引gb的类型user中搜索

/gb,us/user,tweet/_search

在索引gb和us的类型为user和tweet中搜索

/_all/user,tweet/_search

在所有索引的user和tweet中搜索 search types user and tweet in all indices

分页

Elasticsearch接受from和size参数：

size`: 结果数，默认`10
from`: 跳过开始的结果数，默认`0

GET /_search?size=5&from=10

快速全文搜索

Elasticsearch使用一种叫做**倒排索引(inverted index)**的结构来做快速的全文搜索

数据类型

Elasticsearch支持以下简单字段类型：

类型	表示的数据类型
String	`string`
Whole number	`byte`, `short`, `integer`, `long`
Floating point	`float`, `double`
Boolean	`boolean`
Date	`date`

默认输入的转换类型

JSON type	Field type
Boolean: `true` or `false`	`"boolean"`
Whole number: `123`	`"long"`
Floating point: `123.45`	`"double"`
String, valid date: `"2014-09-15"`	`"date"`
String: `"foo bar"`	`"string"`

查看映射

可以使用_mapping后缀来查看Elasticsearch中的映射

GET /gb/_mapping/tweet

{
   "gb": {
      "mappings": {
         "tweet": {
            "properties": {
               "date": {
                  "type": "date",
                  "format": "strict_date_optional_time||epoch_millis"
               },
               "name": {
                  "type": "string"
               },
               "tweet": {
                  "type": "string"
               },
               "user_id": {
                  "type": "long"
               }
            }
         }
      }
   }
}

更新映射

PUT /gb <1>
{
  "mappings": {
    "tweet" : {
      "properties" : {
        "tweet" : {
          "type" :    "string",
          //指定为english分析器
          "analyzer": "english"
        },
        "date" : {
          "type" :   "date"
        },
        "name" : {
          "type" :   "string"
        },
        "user_id" : {
          "type" :   "long"
        }
      }
    }
  }
}

空字段

当然数组可以是空的。这等价于有零个值。事实上，Lucene没法存放null值，所以一个null值的字段被认为是空字段。

这四个字段将被识别为空字段而不被索引：

"empty_string":             "",
"null_value":               null,
"empty_array":              [],
"array_with_null_value":    [ null ]

结构化查询

使用结构化查询，你需要传递query参数

合并多子句

查询的是邮件正文中含有“business opportunity”字样的星标邮件或收件箱中正文中含有“business opportunity”字样的非垃圾邮件

{
    "bool": {
        "must": { "match":      { "email": "business opportunity" }},
        "should": [
             { "match":         { "starred": true }},
             { "bool": {
                   "must":      { "folder": "inbox" }},
                   "must_not":  { "spam": true }}
             }}
        ],
        "minimum_should_match": 1
    }
}

结构化查询 & 结构化过滤

原则上来说，使用查询语句做全文本搜索或其他需要进行相关性评分的时候，剩下的全部用过滤语句

查询过滤语句

term主要用于精确匹配哪些值，比如数字，日期，布尔值或 not_analyzed的字符串(未经分析的文本数据类型)：

    { "term": { "age":    26           }}
    { "term": { "date":   "2014-09-01" }}
    { "term": { "public": true         }}
    { "term": { "tag":    "full_text"  }}

`terms` 过滤

terms 跟 term 有点类似，但 terms 允许指定多个匹配条件。如果某个字段指定了多个值，那么文档需要一起去做匹配：

{
    "terms": {
        "tag": [ "search", "full_text", "nosql" ]
        }
}

`range` 过滤

range过滤允许我们按照指定范围查找一批数据：

{
    "range": {
        "age": {
            "gte":  20,
            "lt":   30
        }
    }
}

范围操作符包含：

gt :: 大于

gte:: 大于等于

lt :: 小于

lte:: 小于等于

`exists` 和 `missing` 过滤

exists 和 missing 过滤可以用于查找文档中是否包含指定字段或没有某个字段，类似于SQL语句中的IS_NULL条件

{
    "exists":   {
        "field":    "title"
    }
}

这两个过滤只是针对已经查出一批数据来，但是想区分出某个字段是否存在的时候使用。

`bool` 过滤

bool 过滤可以用来合并多个过滤条件查询结果的布尔逻辑，它包含一下操作符：

must :: 多个查询条件的完全匹配,相当于 and。

must_not :: 多个查询条件的相反匹配，相当于 not。

should :: 至少有一个查询条件匹配, 相当于 or。

这些参数可以分别继承一个过滤条件或者一个过滤条件的数组：

{
    "bool": {
        "must":     { "term": { "folder": "inbox" }},
        "must_not": { "term": { "tag":    "spam"  }},
        "should": [
                    { "term": { "starred": true   }},
                    { "term": { "unread":  true   }}
        ]
    }
}

`match_all` 查询

使用match_all 可以查询到所有文档，是没有查询条件下的默认语句。

{
    "match_all": {}
}

此查询常用于合并过滤条件。比如说你需要检索所有的邮箱,所有的文档相关性都是相同的，所以得到的_score为1

`match` 查询

match查询是一个标准查询，不管你需要全文本查询还是精确查询基本上都要用到它。

如果你使用 match 查询一个全文本字段，它会在真正查询之前用分析器先分析match一下查询字符：

{
    "match": {
        "tweet": "About Search"
    }
}

如果用match下指定了一个确切值，在遇到数字，日期，布尔值或者not_analyzed 的字符串时，它将为你搜索你给定的值：

{ "match": { "age":    26           }}
{ "match": { "date":   "2014-09-01" }}
{ "match": { "public": true         }}
{ "match": { "tag":    "full_text"  }}

提示：做精确匹配搜索时，你最好用过滤语句，因为过滤语句可以缓存数据。

不像我们在《简单搜索》中介绍的字符查询，match查询不可以用类似"+usid:2 +tweet:search"这样的语句。它只能就指定某个确切字段某个确切的值进行搜索，而你要做的就是为它指定正确的字段名以避免语法错误。

`multi_match` 查询

multi_match查询允许你做match查询的基础上同时搜索多个字段：

{
    "multi_match": {
        "query":    "full text search",
        "fields":   [ "title", "body" ]
    }
}

`bool` 查询

bool 查询与 bool 过滤相似，用于合并多个查询子句。不同的是，bool 过滤可以直接给出是否匹配成功，而bool查询要计算每一个查询子句的 _score （相关性分值）。

must:: 查询指定文档一定要被包含。

must_not:: 查询指定文档一定不要被包含。

should:: 查询指定文档，有则可以为文档相关性加分。

以下查询将会找到 title 字段中包含 "how to make millions"，并且 "tag" 字段没有被标为 spam。如果有标识为 "starred" 或者发布日期为2014年之前，那么这些匹配的文档将比同类网站等级高：

{
    "bool": {
        "must":     { "match": { "title": "how to make millions" }},
        "must_not": { "match": { "tag":   "spam" }},
        "should": [
            { "match": { "tag": "starred" }},
            { "range": { "date": { "gte": "2014-01-01" }}}
        ]
    }
}

带有过滤的查询语句

GET /_search
{
    "query": {
        "filtered": {
            "query":  { "match": { "email": "business opportunity" }},
            "filter": { "term": { "folder": "inbox" }}
        }
    }
}

验证查询合法性

想知道语句非法的具体错误信息，需要加上 explain 参数：

GET /gb/tweet/_validate/query?explain <1>
{
   "query": {
      "tweet" : {
         "match" : "really powerful"
      }
   }
}

排序

默认情况下，结果集会按照相关性进行排序 -- 相关性越高，排名越靠前

GET /_search
{
    "query" : {
        "filtered" : {
            "filter" : { "term" : { "user_id" : 1 }}
        }
    },
    "sort": { "date": { "order": "desc" }}
}

配置自定义分析器

PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": {
                "&_to_and": {
                    "type":       "mapping",
                    "mappings": [ "&=> and "]
            }},
            "filter": {
                "my_stopwords": {
                    "type":       "stop",
                    "stopwords": [ "the", "a" ]
            }},
            "analyzer": {
                "my_analyzer": {
                    "type":         "custom",
                    "char_filter":  [ "html_strip", "&_to_and" ],
                    "tokenizer":    "standard",
                    "filter":       [ "lowercase", "my_stopwords" ]
            }}
}}}

结构化搜索

对于准确值，你需要使用过滤器。过滤器的重要性在于它们非常的快。它们不计算相关性（避过所有计分阶段）而且很容易被缓存。

基于SQL的搜索

SELECT document
FROM   products
WHERE  price = 20

ES的搜索

GET /my_store/products/_search
{
    "query" : {
        "filtered" : { <1>
            "query" : {
                "match_all" : {} <2>
            },
            "filter" : {
                "term" : { <3>
                    "price" : 20
                }
            }
        }
    }
}

match_all 用来匹配所有文档，这是默认行为

问题：按照如下方式搜索得不到结果

GET /my_store/products/_search 
{ 
	"query" : { 
			"filtered" : { 
					"filter" : { 
							"term" : { 
									"productID" : "XHDK-A-1293-#fJ3" 
              } 
          } 
      }
  }
}

所有的字符都被转为了小写。 * 我们失去了连字符和 # 符号。所以当我们用 XHDK-A-1293-#fJ3 来查找时，得不到任何结果，因为这个标记不在我们的倒排索引中。相反，那里有上面列出的四个标记。显然，在处理唯一标识码，或其他枚举值时，这不是我们想要的结果。为了避免这种情况发生，我们需要通过设置这个字段为 not_analyzed 来告诉 Elasticsearch 它包含一个准确值。

组合过滤器

SELECT product
FROM   products
WHERE  (price = 20 OR productID = "XHDK-A-1293-#fJ3")
  AND  (price != 30)

GET /my_store/products/_search
{
   "query" : {
      "filtered" : { <1>
         "filter" : {
            "bool" : {
              "should" : [
                 { "term" : {"price" : 20}}, <2>
                 { "term" : {"productID" : "XHDK-A-1293-#fJ3"}} <2>
              ],
              "must_not" : {
                 "term" : {"price" : 30}, <3>
                // terms支持数组多项过滤
                "terms" : {
                    "price" : [20, 30]
                }
              }
           }
         }
      }
   }
}

日期范围

"range" : {
    "timestamp" : {
        "gt" : "2014-01-01 00:00:00",
        "lt" : "2014-01-07 00:00:00"
    }
}

exists过滤器,可以返回Null值

GET /my_index/posts/_search 
{ 
  "query" : { 
    "filtered" : { 
      "filter" : { 
        "exists" : { 
          "field" : "tags"
        }
      }
    }
  }
}

"hits" : [
  {
    "_id" : "1",
    "_score" : 1.0,
    "_source" : {
      "tags" : ["search"]
    }
  }, 
  { "_id" : "5",
   "_score" : 1.0,
   "_source" : { 
     "tags" : ["search", null]
   }
  }
]