/embulk-input-elasticsearch

Elasticsearch input plugin for Embulk. parallel query support.

Primary LanguageRubyMIT LicenseMIT

Elasticsearch input plugin for Embulk Build Status Gem Version

Overview

  • Plugin type: input
  • Resume supported: yes
  • Cleanup supported: yes
  • Guess supported: no

Configuration

  • nodes: nodes (array, required)
    • host: host (string, required)
    • port: port (integer, required)
  • queries: lucene query array. (array, required)
  • index: index (string, required)
  • index_type: index_type (string)
  • request_timeout: request timeout (integer)
  • per_size: per size query. (integer, required, default: 1000)
  • limit_size: limit size unit query. (integer, default: unlimit)
  • num_threads: number of threads for queries. (integer, default: 1)
  • retry_on_failure: retry on failure. set 0 is retry forever. (integer, default: 5)
  • sort: sort order. (hash, default: nil)
  • scroll: scroll. to keep the search context. (string, default: '1m')
  • fields: fields (array, required)
    • name: name (string, required)
    • type: type (string, required)
    • metadata: metadata (boolean, default: false)
    • time_format: time_format (string)

Example

in:
  type: elasticsearch
  nodes:
    - {host: localhost, port: 9200}
  queries:
    - 'page_type: HP'
    - 'page_type: GP'
  index: crawl
  index_type: m_corporation_page
  request_timeout: 60
  per_size: 1000
  limit_size: 200000
  num_threads: 2
  sort:
    m_corporation_id: desc
    employee_range: asc
  fields:
    - { name: _id, type: string, metadata: true }
    - { name: _type, type: string, metadata: true }
    - { name: _index, type: string, metadata: true }
    - { name: _score, type: double, metadata: true }
    - { name: page_type, type: string }
    - { name: corp_name, type: string }
    - { name: corp_key, type: string }
    - { name: title, type: string }
    - { name: body, type: string }
    - { name: url, type: string }
    - { name: employee_range, type: long }
    - { name: m_corporation_id, type: long }
    - { name: cg_lv1, type: json }
    - { name: cg_lv2, type: json }
    - { name: cg_lv3, type: json }

Support Type

  • string
  • long
  • double
  • timestamp
  • json
  • boolean

test

setup

curl -o embulk.jar --create-dirs -L "http://dl.embulk.org/embulk-latest.jar"
chmod +x embulk.jar
./embulk.jar gem install bundler
./embulk.jar bundle install --path vendor/bundle

run test

./embulk.jar bundle exec rake test

Build

$ rake