Temporary CSV files generated have control M characters
Closed this issue ยท 3 comments
We are trying to load data from Oracle to Bigquery. However, temporary CSV files that are generated contain control M characters which results in newline. Bugquery hence cannot process this csv file. Are there any filters to get rid of such cases? Any help would be appreciated.
Note: Specified tables are around 15+ GBs in size.
Sample config.yml:
in:
type: oracle
driver_path: /ojdbc7-12.1.0.2.jar
url: jdbc:oracle:thin:@something.com:1526/DB
user: user
password: "password"
query: "SELECT * SCHEMA.TABLE"
fetch_rows: 4000
connect_timeout: 100
formatter:
type: csv
delimiter: ","
newline: CR
newline_in_field: CR
escape: "\\"
null_string: "\\N"
out:
type: bigquery
mode: replace
auth_method: service_account
project: project
dataset: "dataset"
table: "table"
location: europe-west2
json_keyfile: credentials.json
allow_quoted_newlines: true
abort_on_error: false
delete_from_local_when_job_end: false
Hello, @OmkarPathak
I'm not sure temporary CSV
means. It seems that input data is an Oracle database.
Anyway, Have you ever tried the embulk-filter-ruby_proc plugin?
If an input data contains CR
, you can remove/replace it.
This is an example.
filters:
- type: ruby_proc
columns:
- name: account
proc: |
-> (record){ record.nil? ? nil : record.gsub(",","") }
Change CR
-> LF
filters:
- type: ruby_proc
columns:
- name: account
proc: |
-> (record){ record.nil? ? nil : record.gsub("\r","\n") }
Trim CR
filters:
- type: ruby_proc
columns:
- name: account
proc: |
-> (record){ record.nil? ? nil : record.gsub("\r","") }
Another idea is to use embulk-filter-to_json
for example.
Could you read this document?
https://github.com/embulk/embulk-output-bigquery#formatter-performance-issue
I don't think embulk-input-oracle creates temporary CSV file.
Additionally, embulk-input-oracle is a Input plugin, isn't a File input plugin. You can't use formatter:
option with embulk-input-oracle. Formatter plugin only works with File input plugin.
The component that creates a CSV file is embulk-output-bigquery.
https://github.com/embulk/embulk-output-bigquery/blob/a4b25bd4b814d9f7f5efb3df0af8fc19a48f0a8f/lib/embulk/output/bigquery/file_writer.rb#L29-L34
Why don't you use source_format: NEWLINE_DELIMITED_JSON
? So that file will be formatted with JSON.
https://github.com/embulk/embulk-output-bigquery/#same-options-of-bq-command-line-tools-or-bigquery-jobs-property
Although the JSON file also may contain a newline, the process itself would succeed.
@sakama @hiroyuki-sato thanks for such quick responses. I tried the solution provided by @sakama and it worked! Thanks a lot ๐