Cannot specify csv parser
bcipolli opened this issue · 4 comments
Hi,
We are trying to use command to upload a csv to redshift. We first run embulk guess and then embulk run. We try to specify a parser: csv on the command input.
embulk guess ignores our parser, and instead inserts its own.
Are we doing this wrong? Should we avoid calling embulk guess for input type command? Or is there some way to get the command input type to use our parser settings?
Hello @bcipolli
Why don't you use FileInputPlugin ex) s3,file?
A command generate CSV? And do you want to guess it?
If so, here is an example.
Maybe...
parser: csv is wrong.
Correct
parser:
type: csvseed.yml
in:
type: command
# command: "ruby -e 'print \"a,c,c\n\"*20'"
command: cat sample.txt
parser:
type: csv
out:
type: stdoutsample.txt
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbbembulk guess seed.yml -o config.yml
in:
type: command
command: cat sample.txt
parser:
type: csv
charset: UTF-8
newline: LF
delimiter: ','
quote: '"'
escape: '"'
trim_if_not_quoted: false
skip_header_lines: 0
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: c0, type: string}
- {name: c1, type: long}
- {name: c2, type: string}
type: stdoutThanks for the advice @hiroyuki-sato. We will try moving to the FileInputPlugin, that does make more sense for what we're doing.
However, I am still concerned about guess. I think I've isolated the problem. Taking your example, if sample.txt is the following:
id,website,name
848,"www.facebook.com
",Facebook
1800,www.google.com,Google
The columns are guessed incorrectly:
in:
type: command
command: cat sample.txt
parser:
type: csv
charset: UTF-8
newline: LF
delimiter: ','
quote: null
trim_if_not_quoted: false
skip_header_lines: 0
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: c0, type: string}
- {name: c1, type: string}
- {name: c2, type: string}
out: {type: stdout}
The issue appears to be the line break in the first row of our data. If we remove the line break:
id,website,name
848,"www.facebook.com",Facebook
1800,www.google.com,Google
The columns are guessed correctly:
in:
type: command
command: cat sample.txt
parser:
type: csv
charset: UTF-8
newline: LF
delimiter: ','
quote: '"'
escape: '"'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: website, type: string}
- {name: name, type: string}
out: {type: stdout}
I will work on our command so that our data doesn't include these line breaks, but this should be handled correctly. I'm guessing it's an issue with CSV parsing.
The guess command cannot parse CSV completely.
(Because there are many CSV format.)
And sometimes return an unexpected result.
If you add quote: '"' explicitly,
the guess command output the result as you expected.
If you need more discussion about guess command,
please create a new issue on the Embulk repo.
id,website,name
848,"www.facebook.com
",Facebook
1800,www.google.com,Google
in:
type: file
path_prefix: sample2.txt
parser:
type: csv
quote: '"' # <-- NEW
out:
type: stdoutThe guess output the blow.
in:
type: file
path_prefix: sample2.txt
parser:
type: csv
quote: '"'
charset: UTF-8
newline: LF
delimiter: ','
escape: '"'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: website, type: string}
- {name: name, type: string}
out: {type: stdout}I would like to close this issue.
Please feel free to reopen the issue if you need additional questions.