Cannot specify csv parser

Question

Cannot specify csv parser

Closed this issue 8 months ago · 4 comments

Hi,

We are trying to use command to upload a csv to redshift. We first run embulk guess and then embulk run. We try to specify a parser: csv on the command input.

embulk guess ignores our parser, and instead inserts its own.

Are we doing this wrong? Should we avoid calling embulk guess for input type command? Or is there some way to get the command input type to use our parser settings?

Answer 1 · 2017-06-08T00:32:50.000Z

Hello @bcipolli

Why don't you use FileInputPlugin ex) s3,file?
A command generate CSV? And do you want to guess it?
If so, here is an example.

Maybe...

parser: csv is wrong.

Correct

parser:
  type: csv

seed.yml

in:
  type: command
#  command: "ruby -e 'print \"a,c,c\n\"*20'"
  command: cat sample.txt
  parser:
    type: csv

out:
  type: stdout

sample.txt

aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb

embulk guess seed.yml -o config.yml

in:
  type: command
  command: cat sample.txt
  parser:
    type: csv
    charset: UTF-8
    newline: LF
    delimiter: ','
    quote: '"'
    escape: '"'
    trim_if_not_quoted: false
    skip_header_lines: 0
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: c0, type: string}
    - {name: c1, type: long}
    - {name: c2, type: string}
type: stdout

Answer 2 · 2017-06-08T17:27:23.000Z

Thanks for the advice @hiroyuki-sato. We will try moving to the FileInputPlugin, that does make more sense for what we're doing.

However, I am still concerned about guess. I think I've isolated the problem. Taking your example, if sample.txt is the following:

id,website,name
848,"www.facebook.com
",Facebook
1800,www.google.com,Google

The columns are guessed incorrectly:

in:
  type: command
  command: cat sample.txt
  parser:
    type: csv
    charset: UTF-8
    newline: LF
    delimiter: ','
    quote: null
    trim_if_not_quoted: false
    skip_header_lines: 0
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: c0, type: string}
    - {name: c1, type: string}
    - {name: c2, type: string}
out: {type: stdout}

The issue appears to be the line break in the first row of our data. If we remove the line break:

id,website,name
848,"www.facebook.com",Facebook
1800,www.google.com,Google

The columns are guessed correctly:

in:
  type: command
  command: cat sample.txt
  parser:
    type: csv
    charset: UTF-8
    newline: LF
    delimiter: ','
    quote: '"'
    escape: '"'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: id, type: long}
    - {name: website, type: string}
    - {name: name, type: string}
out: {type: stdout}

I will work on our command so that our data doesn't include these line breaks, but this should be handled correctly. I'm guessing it's an issue with CSV parsing.

Answer 3 · 2017-06-09T02:43:23.000Z

@mjalkio

The guess command cannot parse CSV completely.
(Because there are many CSV format.)
And sometimes return an unexpected result.

If you add quote: '"' explicitly,
the guess command output the result as you expected.

If you need more discussion about guess command,
please create a new issue on the Embulk repo.

id,website,name
848,"www.facebook.com
",Facebook
1800,www.google.com,Google

in:
  type: file
  path_prefix: sample2.txt
  parser:
    type: csv
    quote: '"'  # <-- NEW
out:
  type: stdout

The guess output the blow.

in:
  type: file
  path_prefix: sample2.txt
  parser:
    type: csv
    quote: '"'
    charset: UTF-8
    newline: LF
    delimiter: ','
    escape: '"'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: id, type: long}
    - {name: website, type: string}
    - {name: name, type: string}
out: {type: stdout}

Answer 4 · 2024-05-08T07:51:33.000Z

I would like to close this issue.
Please feel free to reopen the issue if you need additional questions.