/embulk-input-bigquery_extract_files

Google Cloud Bigquery's table or query result extract to google cloud storage and download it

Primary LanguageJavaMIT LicenseMIT

Google Cloud Bigquery extract file input plugin for Embulk

embulk file input plugin.

Read files stored in Google Cloud Storage that extracted from Google Cloud Bigquery's table or query result.

Overview

  • Plugin type: file input
  • Resume supported: no
  • Cleanup supported: yes

Detail

Read files stored in Google Cloud Storage, that exported from Google Cloud Bigquery's table or query result.

Maybe solution for very big data in bigquery.

If you set table config without query config, then just extract table to Google Cloud Storage.

If you set query config, then query result save to temp table and then extracted that temp table to Google Cloud Storage uri. see : https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.extract

Usage

Install plugin

embulk gem install embulk-input-bigquery_extract_files

Configuration

Example

in:
  type: bigquery_extract_files
  project: googlecloudplatformproject
  json_keyfile: gcp-service-account-private-key.json
  dataset: target_dataset
  #table: target_table
  query: 'select a,b,c from target_table'
  gcs_uri: gs://bucket/subdir
  temp_dataset: temp_dataset
  temp_local_path: C:\Temp
  file_format: 'NEWLINE_DELIMITED_JSON'
  compression: 'GZIP'
  decoders:
  - {type: gzip}  
  parser:
    type: json
out: 
  type: stdout

Advenced Example

bigquery to mysql with auto-schema

I have to batch bigquery table to mysql every day for my job. then, I wan'to get auto-schema for this file input plugin.

this is my best practive for bigquery to mysql batch config.

in:
  type: bigquery_extract_files
  project: my-google-project
  json_keyfile: /tmp/embulk/google_service_account.json
  query: 'select * from dataset.t_nitocris'
  temp_dataset: temp_dataset
  gcs_uri: gs://bucket/embulktemp/t_nitocris_*
  temp_local_path: /tmp/embulk/data
  file_format: 'CSV'
  compression: 'GZIP'
  temp_schema_file_path: /tmp/embulk/schema/csv_schema_nitocris.json
  decoders:
  - {type: gzip}
  parser:
    type: csv_with_schema_file
    default_timestamp_format: '%Y-%m-%d %H:%M:%S %z'
    schema_path: /tmp/embulk/schema/csv_schema_nitocris.json
out:
  type: mysql
  host: host
  user: user
  password: password
  port: 3306
  database: MY_DATABASE
  table: 
  options: {connectTimeout: 0, waitTimeout: 0, enableQueryTimeouts: false, autoReconnect: true}
  mode: insert_direct
  retry_limit: 60
  retry_wait: 3000
  batch_size: 4096000
  • bigquery's TIMESTAMP value format is not exactly one format in one CSV export. you can define optional 'columns' param in 'csv_with_schema_file' parser for another format like below.
...
  parser:
    type: csv_with_schema_file
    default_timestamp_format: '%Y-%m-%d %H:%M:%S %z'
    schema_path: /tmp/embulk/schema/csv_schema_nitocris.json
    columns:
      - {name: Date2, type: timestamp, format: '%Y-%m-%d %H:%M:%S.%N %z'}

Build

$ ./gradlew gem  # -t to watch change of files and rebuild continuously

Another choice

This plugin useful for file-input type. but maybe so complicated to use.

For portable use, see embulk-input-bigquery : https://github.com/medjed/embulk-input-bigquery