[Elasticsearch 7.x] Join field type

Question

[Elasticsearch 7.x] Join field type

Opened this issue 4 years ago · 2 comments

https://www.elastic.co/guide/en/elasticsearch/reference/current/parent-join.html

Join field type

join datatype 은 같은 index의 documents 에서 parent/child 관계를 작성하는 특수한 필드다.
- relations 는 documents 내에서 가능한 관계 세트를 정의하며, 각 관계는 parent 이름과 child 이름이다.
- parent/child 관계는 다음처럼 정의 할 수 있다.

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "my_id": {
        "type": "keyword"
      },
      "my_join_field": { // 이름 필드
        "type": "join",
        "relations": {
          "question": "answer"  // `answer` 의 부모는`question` 인 single 관계 정의
        }
      }
    }
  }
}

join 으로 document 를 index 하기 위해, source 에 관계 이름과 부모 document (optional) 가 제공되어야 한다.
- 예를 들어, question 컨택스트에서 두개의 parent documents 생성한다.

PUT my-index-000001/_doc/1?refresh
{
  "my_id": "1",
  "text": "This is a question",
  "my_join_field": {
    "name": "question" // 이 document 는 `question` document 이다.
  }
}

PUT my-index-000001/_doc/2?refresh
{
  "my_id": "2",
  "text": "This is another question",
  "my_join_field": {
    "name": "question"
  }
}

parent documents 를 indexing 할때, 일반 오브젝트 표기법으로 캡슐화 하지 않고, shortcut 으로 관계를 바로 지정할 수 있다.

PUT my-index-000001/_doc/1?refresh
{
  "my_id": "1",
  "text": "This is a question",
  "my_join_field": "question"  // parent document 에 대한 간단한 표기법으로 관계 이름을 사용한다.
}

PUT my-index-000001/_doc/2?refresh
{
  "my_id": "2",
  "text": "This is another question",
  "my_join_field": "question"
}

child 를 indexing 할때, 관계의 이름과 document의 parent id 를 _source 에 추가해야 한다.

같은 shard 에서 parent 계보를 index 해야 하므로, 항상 parent id 를 사용하여 child documents 를 라우트 해야 한다.

2개의 child documents 를 index 하는 예제이다.

PUT my-index-000001/_doc/3?routing=1&refresh // routing value 는 parent 와 child documents 를 같은 shard 에 index 해야 하므로 필수다.
{
  "my_id": "3",
  "text": "This is an answer",
  "my_join_field": {
    "name": "answer", // `answer` 는 이 document 의 join 이름이다.
    "parent": "1"  // 이 child document의 pareint id
  }
}

PUT my-index-000001/_doc/4?routing=1&refresh
{
  "my_id": "4",
  "text": "This is another answer",
  "my_join_field": {
    "name": "answer",
    "parent": "1"
  }
}

Answer 1 · 2022-01-10T13:56:24.000Z

Parent-join and performance

join 필드는 관계 데이터베이스처럼 사용되어서는 안됩니다. Elasticsearch에서 좋은 성능의 핵심은 데이터를 문서로 de-normalize 하는 것입니다. 각 join field 또는 has_child or has_parent 쿼리는 쿼리 성능에 상당한 세금을 추가합니다. 또한 global ordinals 가 구축되도록 트리거할 수 있습니다.

join 필드가 의미가 있는 유일한 경우는 데이터에 한 엔터티가 다른 엔터티보다 훨씬 많은 일대다 관계가 포함된 경우입니다. 이러한 사례의 예로는 이러한 제품 및 오퍼의 사용 사례가 있습니다. 오퍼가 제품 수보다 훨씬 더 많은 경우 제품을 상위 문서로 모델링하고 오퍼를 하위 문서로 모델링하는 것이 합리적입니다.

Parent-join restrictions

인덱스당 하나의 join 필드 매핑만 허용됩니다.
상위 및 하위 문서는 같은 샤드에서 인덱싱되어야 합니다. 즉, 자식 문서를 getting, deleting, or updating 할때 같은 routing 값을 제공해야 합니다.
엘리먼트는 여러 자식을 가질 수 있지만 부모는 하나만 있을 수 있습니다.
기존 join 필드에 새 관계를 추가할 수 있습니다.
기존 엘리먼트에 자식을 추가하는 것도 가능하지만 엘리먼트가 이미 부모인 경우에만 가능합니다.

Searching with parent-join

parent-join 은 문서 내에 관계 이름(my_parent, my_child, ...)을 인덱싱하기 위해 하나의 필드를 만듭니다.
또한 parent/child 관계당 하나의 필드를 만듭니다. 이 필드의 이름은 # 뒤에 오는 join 필드의 이름과 관계의 상위 이름입니다. 따라서 예를 들어 my_parent → [my_child, another_child] 관계의 경우 조인 필드는 my_join_field#my_parent 라는 추가 필드를 생성합니다.
이 필드는 문서가 하위(my_child 또는 another_child)인 경우 문서가 링크하는 상위 _id 및 parent 인 경우 문서의 _id(my_parent)를 포함합니다.
join 필드가 포함된 인덱스를 검색할 때 다음 두 필드가 항상 검색 응답에 반환됩니다.

GET my-index-000001/_search
{
  "query": {
    "match_all": {}
  },
  "sort": ["my_id"]
}

Will return:

{
  ...,
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "my-index-000001",
        "_type": "_doc",
        "_id": "1",
        "_score": null,
        "_source": {
          "my_id": "1",
          "text": "This is a question",
          "my_join_field": "question" # 이 문서는 `question` join 에 속한다.
        },
        "sort": [
          "1"
        ]
      },
      {
        "_index": "my-index-000001",
        "_type": "_doc",
        "_id": "2",
        "_score": null,
        "_source": {
          "my_id": "2",
          "text": "This is another question",
          "my_join_field": "question" # 이 문서는 `question` join 에 속한다.
        },
        "sort": [
          "2"
        ]
      },
      {
        "_index": "my-index-000001",
        "_type": "_doc",
        "_id": "3",
        "_score": null,
        "_routing": "1",
        "_source": {
          "my_id": "3",
          "text": "This is an answer",
          "my_join_field": {
            "name": "answer", # 이 문서는 `answer` join 에 속한다.
            "parent": "1" # child 문서에 연결되어 있는 parent id
          }
        },
        "sort": [
          "3"
        ]
      },
      {
        "_index": "my-index-000001",
        "_type": "_doc",
        "_id": "4",
        "_score": null,
        "_routing": "1",
        "_source": {
          "my_id": "4",
          "text": "This is another answer",
          "my_join_field": {
            "name": "answer",
            "parent": "1"
          }
        },
        "sort": [
          "4"
        ]
      }
    ]
  }
}

Parent-join queries and aggregations

자세한 내용은 has_child 및 has_parent 쿼리, children aggregation 및 inner hits를 참조하세요.

join 필드의 값은 aggregations 및 scripts 에서 액세스할 수 있으며 parent_id query로 쿼리할 수 있습니다.

GET my-index-000001/_search
{
  "query": {
    "parent_id": { # parent id 쿼리
      "type": "answer",
      "id": "1"
    }
  },
  "aggs": {
    "parents": {
      "terms": {
        "field": "my_join_field#question",  # parent id 로 aggregating
        "size": 10
      }
    }
  },
  "runtime_mappings": {
    "parent": {
      "type": "long",
      "script": "emit(Integer.parseInt(doc['my_join_field#question'].value))" # 스크립트의 parent id 필드로 접근
    }
  },
  "fields": [
    { "field": "parent" }
  ]
}

Answer 2 · 2022-01-20T10:12:37.000Z

Global ordinals

join 필드는 global ordinals 를 사용하여 조인 속도를 높입니다. 샤드를 변경한 후 global ordinals 를 다시 작성해야 합니다. 더 많은 parent ID 값이 샤드에 저장될수록 join 필드에 대한 global ordinals를 다시 작성하는 데 더 오래 걸립니다.

Global ordinals는 기본적으로 빠르게 빌드됩니다. 인덱스가 변경된 경우 join 필드에 대한 global ordinals가 새로 고침의 일부로 다시 작성됩니다. 이렇게 하면 새로 고침에 상당한 시간이 추가될 수 있습니다. 그러나 대부분의 경우 이것이 올바른 trade-off 입니다. 그렇지 않으면 첫 번째 parent-join 쿼리 또는 aggregation 가 사용될 때 global ordinals 가 다시 작성됩니다. 이로 인해 사용자에게 상당한 지연 시간 스파이크가 발생할 수 있으며 일반적으로 많은 쓰기가 발생할 때 join 필드에 대한 여러 global ordinals 가 단일 새로 고침 간격 내에서 재구축을 시도할 수 있으므로 더 나쁩니다.

join 필드가 자주 사용되지 않고 쓰기가 자주 발생하는 경우 즉시 로드를 비활성화하는 것이 합리적일 수 있습니다.

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "my_join_field": {
        "type": "join",
        "relations": {
           "question": "answer"
        },
        "eager_global_ordinals": false
      }
    }
  }
}

global ordinals가 사용하는 힙의 양은 부모 관계별로 다음과 같이 확인할 수 있습니다.

# Per-index
GET _stats/fielddata?human&fields=my_join_field#question

# Per-node per-index
GET _nodes/stats/indices/fielddata?human&fields=my_join_field#question

Multiple children per parent

한 부모에 대해 여러 자식을 정의할 수도 있습니다.

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "my_join_field": {
        "type": "join",
        "relations": {
          "question": ["answer", "comment"]  # question 는 answer 과 comment 의 parent 이다.
        }
      }
    }
  }
}

Multiple levels of parent join

관계형 모델을 복제하기 위해 여러 수준의 관계를 사용하는 것은 권장되지 않습니다. 관계의 각 수준은 메모리 및 계산 측면에서 쿼리 시간에 오버헤드를 추가합니다. 성능이 중요하다면 데이터를 비정규화해야 합니다.

여러 수준의 부모/자식:

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "my_join_field": {
        "type": "join",
        "relations": {
          "question": ["answer", "comment"],  # question 는 answer 과 comment 의 parent 이다.
          "answer": "vote" # answer 는 vote 의 parent 이다.
        }
      }
    }
  }
}

위의 매핑은 다음 트리를 나타냅니다.
question
/
/
comment answer
|
|
vote
하위 문서를 인덱싱하려면 grand-parent(계보의 parent 보다 높은)와 동일한 routing 값이 필요합니다.

PUT my-index-000001/_doc/3?routing=1&refresh # 이 하위 문서는 상위 및 상위 문서와 동일한 샤드에 있어야 합니다.
{
  "text": "This is a vote",
  "my_join_field": {
    "name": "vote",
    "parent": "2" # 이 문서의 parent ID(answer 문서를 가리켜야 함)
  }
}