Cannot specify csv parser
Closed this issue · 4 comments
Hi,
We are trying to use command
to upload a csv to redshift. We first run embulk guess
and then embulk run
. We try to specify a parser: csv
on the command
input.
embulk guess
ignores our parser
, and instead inserts its own.
Are we doing this wrong? Should we avoid calling embulk guess
for input type command
? Or is there some way to get the command
input type to use our parser
settings?
Hello @bcipolli
Why don't you use FileInputPlugin ex) s3
,file
?
A command
generate CSV? And do you want to guess it?
If so, here is an example.
Maybe...
parser: csv
is wrong.
Correct
parser:
type: csv
seed.yml
in:
type: command
# command: "ruby -e 'print \"a,c,c\n\"*20'"
command: cat sample.txt
parser:
type: csv
out:
type: stdout
sample.txt
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
aaa,123,bbb
embulk guess seed.yml -o config.yml
in:
type: command
command: cat sample.txt
parser:
type: csv
charset: UTF-8
newline: LF
delimiter: ','
quote: '"'
escape: '"'
trim_if_not_quoted: false
skip_header_lines: 0
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: c0, type: string}
- {name: c1, type: long}
- {name: c2, type: string}
type: stdout
Thanks for the advice @hiroyuki-sato. We will try moving to the FileInputPlugin, that does make more sense for what we're doing.
However, I am still concerned about guess
. I think I've isolated the problem. Taking your example, if sample.txt
is the following:
id,website,name
848,"www.facebook.com
",Facebook
1800,www.google.com,Google
The columns are guessed incorrectly:
in:
type: command
command: cat sample.txt
parser:
type: csv
charset: UTF-8
newline: LF
delimiter: ','
quote: null
trim_if_not_quoted: false
skip_header_lines: 0
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: c0, type: string}
- {name: c1, type: string}
- {name: c2, type: string}
out: {type: stdout}
The issue appears to be the line break in the first row of our data. If we remove the line break:
id,website,name
848,"www.facebook.com",Facebook
1800,www.google.com,Google
The columns are guessed correctly:
in:
type: command
command: cat sample.txt
parser:
type: csv
charset: UTF-8
newline: LF
delimiter: ','
quote: '"'
escape: '"'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: website, type: string}
- {name: name, type: string}
out: {type: stdout}
I will work on our command so that our data doesn't include these line breaks, but this should be handled correctly. I'm guessing it's an issue with CSV parsing.
The guess
command cannot parse CSV completely.
(Because there are many CSV format.)
And sometimes return an unexpected result.
If you add quote: '"'
explicitly,
the guess
command output the result as you expected.
If you need more discussion about guess
command,
please create a new issue on the Embulk repo.
id,website,name
848,"www.facebook.com
",Facebook
1800,www.google.com,Google
in:
type: file
path_prefix: sample2.txt
parser:
type: csv
quote: '"' # <-- NEW
out:
type: stdout
The guess
output the blow.
in:
type: file
path_prefix: sample2.txt
parser:
type: csv
quote: '"'
charset: UTF-8
newline: LF
delimiter: ','
escape: '"'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: website, type: string}
- {name: name, type: string}
out: {type: stdout}
I would like to close this issue.
Please feel free to reopen the issue if you need additional questions.