Temporary CSV files generated have control M characters

Question

Temporary CSV files generated have control M characters

Closed this issue 5 years ago · 3 comments

We are trying to load data from Oracle to Bigquery. However, temporary CSV files that are generated contain control M characters which results in newline. Bugquery hence cannot process this csv file. Are there any filters to get rid of such cases? Any help would be appreciated.
Note: Specified tables are around 15+ GBs in size.
Sample config.yml:

in:
  type: oracle
  driver_path: /ojdbc7-12.1.0.2.jar
  url: jdbc:oracle:thin:@something.com:1526/DB
  user: user
  password: "password"
  query: "SELECT * SCHEMA.TABLE"
  fetch_rows: 4000
  connect_timeout: 100
  formatter:
    type: csv
    delimiter: ","
    newline: CR
    newline_in_field: CR
    escape: "\\"
    null_string: "\\N"
out:
   type: bigquery
   mode: replace
   auth_method: service_account
   project: project
   dataset: "dataset"
   table: "table"
   location: europe-west2
   json_keyfile: credentials.json
   allow_quoted_newlines: true
   abort_on_error: false
   delete_from_local_when_job_end: false

Answer 1 · 2020-02-25T06:25:00.000Z

Hello, @OmkarPathak

I'm not sure temporary CSV means. It seems that input data is an Oracle database.

Anyway, Have you ever tried the embulk-filter-ruby_proc plugin?
If an input data contains CR, you can remove/replace it.

This is an example.

filters:
  - type: ruby_proc
    columns:
    - name: account
      proc: |
        -> (record){ record.nil? ? nil :  record.gsub(",","") }

Change CR -> LF

filters:
  - type: ruby_proc
    columns:
    - name: account
      proc: |
        -> (record){ record.nil? ? nil :  record.gsub("\r","\n") }

Trim CR

filters:
  - type: ruby_proc
    columns:
    - name: account
      proc: |
        -> (record){ record.nil? ? nil :  record.gsub("\r","") }

Another idea is to use embulk-filter-to_json for example.
Could you read this document?
https://github.com/embulk/embulk-output-bigquery#formatter-performance-issue

Answer 2 · 2020-02-25T06:35:54.000Z

I don't think embulk-input-oracle creates temporary CSV file.
Additionally, embulk-input-oracle is a Input plugin, isn't a File input plugin. You can't use formatter: option with embulk-input-oracle. Formatter plugin only works with File input plugin.

The component that creates a CSV file is embulk-output-bigquery.
https://github.com/embulk/embulk-output-bigquery/blob/a4b25bd4b814d9f7f5efb3df0af8fc19a48f0a8f/lib/embulk/output/bigquery/file_writer.rb#L29-L34

Why don't you use source_format: NEWLINE_DELIMITED_JSON ? So that file will be formatted with JSON.
https://github.com/embulk/embulk-output-bigquery/#same-options-of-bq-command-line-tools-or-bigquery-jobs-property
Although the JSON file also may contain a newline, the process itself would succeed.

Answer 3 · 2020-02-25T08:07:07.000Z

@sakama @hiroyuki-sato thanks for such quick responses. I tried the solution provided by @sakama and it worked! Thanks a lot 😄