hiroyuki-sato/embulk-parser-jsonpath

Cant use guess_sample_buffer_bytes?

Closed this issue · 3 comments


  • Embulk v0.8.31

Gemfile

source 'https://rubygems.org'                                                                                                                                 

# for input json
gem 'embulk-parser-jsonpath', '~> 0.2.0'

I just tried this for guess but not worked..

exec:                                                                                                                                                         
  guess_sample_buffer_bytes: 136192
in:
  type: file
  path_prefix: tmp/
out:
  type: stdout

$ embulk guess -g jsonpath config.yml.liquid -o guess.yml

2017-09-06 23:15:36.677 +0900: Embulk v0.8.31
2017-09-06 23:16:00.086 +0900 [INFO] (0001:guess): Listing local files at directory 'tmp' filtering filename by prefix ''
2017-09-06 23:16:00.087 +0900 [INFO] (0001:guess): "follow_symlinks" is set false. Note that symbolic links to directories are skipped.
2017-09-06 23:16:00.092 +0900 [INFO] (0001:guess): Loading files [tmp/test.json]
2017-09-06 23:16:00.099 +0900 [INFO] (0001:guess): Try to read 136,192 bytes from input source
2017-09-06 23:16:00.145 +0900 [INFO] (0001:guess): Loaded plugin embulk/guess/gzip from a load path
2017-09-06 23:16:00.157 +0900 [INFO] (0001:guess): Loaded plugin embulk/guess/bzip2 from a load path
2017-09-06 23:16:00.169 +0900 [INFO] (0001:guess): Loaded plugin embulk/guess/json from a load path
2017-09-06 23:16:00.174 +0900 [INFO] (0001:guess): Loaded plugin embulk/guess/csv from a load path
2017-09-06 23:16:00.229 +0900 [INFO] (0001:guess): Loaded plugin embulk-parser-jsonpath (0.2.0)
org.jruby.exceptions.RaiseException: (null) unexpected token at '{
    "date": "2017-08-06",
    "clicks": 0,
    "ctr": 0,
    "impressions": 2,
    "keyword": "hogehoge",
  '
	at RUBY.load(/Users/kieaiaarh/.embulk/jruby/2.3.0/gems/multi_json-1.12.2/lib/multi_json.rb:124)
	at RUBY.process_object(/Users/kieaiaarh/.embulk/jruby/2.3.0/gems/jsonpath-0.5.8/lib/jsonpath.rb:87)
	at RUBY.enum_on(/Users/kieaiaarh/.embulk/jruby/2.3.0/gems/jsonpath-0.5.8/lib/jsonpath.rb:73)
	at RUBY.on(/Users/kieaiaarh/.embulk/jruby/2.3.0/gems/jsonpath-0.5.8/lib/jsonpath.rb:65)
	at RUBY.guess_text(/Users/kieaiaarh/.embulk/jruby/2.3.0/gems/embulk-parser-jsonpath-0.2.0/lib/embulk/guess/jsonpath.rb:12)
	at RUBY.guess(uri:classloader:/embulk/guess_plugin.rb:78)
	at RUBY.guess(uri:classloader:/embulk/guess_plugin.rb:24)
:

and I cut data , which is 32K because input json data is too large

$ ls -la tmp
-rw-r--r--  1 kieaiaarh  staff    32K  9  6 23:18 tmp/test.json`

and its worked!

embulk guess -g jsonpath config.yml.liquid -o guess.yml
2017-09-06 23:20:23.739 +0900: Embulk v0.8.31
2017-09-06 23:20:46.922 +0900 [INFO] (0001:guess): Listing local files at directory 'tmp' filtering filename by prefix ''
2017-09-06 23:20:46.923 +0900 [INFO] (0001:guess): "follow_symlinks" is set false. Note that symbolic links to directories are skipped.
2017-09-06 23:20:46.928 +0900 [INFO] (0001:guess): Loading files [tmp/test.json]
2017-09-06 23:20:46.935 +0900 [INFO] (0001:guess): Try to read 136,192 bytes from input source
2017-09-06 23:20:46.981 +0900 [INFO] (0001:guess): Loaded plugin embulk/guess/gzip from a load path
2017-09-06 23:20:46.991 +0900 [INFO] (0001:guess): Loaded plugin embulk/guess/bzip2 from a load path
2017-09-06 23:20:47.003 +0900 [INFO] (0001:guess): Loaded plugin embulk/guess/json from a load path
2017-09-06 23:20:47.008 +0900 [INFO] (0001:guess): Loaded plugin embulk/guess/csv from a load path
2017-09-06 23:20:47.061 +0900 [INFO] (0001:guess): Loaded plugin embulk-parser-jsonpath (0.2.0)
exec: {guess_sample_buffer_bytes: 136192}
in:
  type: file
  path_prefix: tmp/
  parser:
    charset: UTF-8
    newline: LF
    type: jsonpath
    delimiter: ','
    quote: '"'
    escape: '"'
    trim_if_not_quoted: false
    skip_header_lines: 2
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: date, type: timestamp, format: '%Y-%m-%d'}
    - {name: clicks, type: long}
    - {name: ctr, type: double}
    - {name: impressions, type: long}
    - {name: keyword, type: string}
    - {name: position, type: double}
out: {type: stdout}

Created 'guess.yml' file.

but I cant understand ....
according to embulk/embulk#609
It seems that guess config(yml) can configure guess_sample_buffer_bytes...

Thanks ur help.

@kieaiaarh Thank you for reporting this issue.
I'm investigating this issue.

@kieaiaarh
It seems that the cause is embulk-core.
I reported it. embulk/embulk#788

I'll let you know when the issue fix.

embulk/embulk#788 fixed this issue.