dbcp

$ dbcp --help
cli.js [inputFile] [outputFile]
$ dbcp data.parquet data.jsonl.gz
$ dbcp data.jsonl.gz s3://bucket/data.csv.gz

Copy from or to MySQL, PostgreSQL, SQLServer, LevelDB, MongoDB, and ElasticSearch directly to/from files on Amazon Web Services (AWS) S3, Google Cloud Storage (GCS), Microsoft Azure, SMB, HTTP, or another database.

Automatically converts between supported formats JSON, ND-JSON, CSV, SQL, Parquet, and TFRecord (with optional gzip compression).

CLI

Either --inputType or --inputFile and --outputType or --outputFile are required. Other options can be shortened, e.g --user instead of --inputUser. Only a database-to-database copy requires both --inputUser and --outputUser. The file format and compression is inferred from the filename. dbcp pipes Readable Node.JS streams to Writable streams. No intermediate storage is required.

API

The transformObject/transformObjectStream API can be used for streaming transforms of Big Data.
The CLI uses transformBytes/transformBytesStream to render progress updates.

Features

dbcp supports sharding. It can split or join groups of files.
dbcp can convert files from one format to another.
dbcp supports compound inserts, which can insert groups of associated rows from multiple tables.
dbcp can translate SQL dialects, e.g. dump a Postgres table to .sql file with SQLServer CREATE and INSERT syntax.

Credits

Database powered by knex and knex-schema-inspector
File system, file formats, and sharding provided by @wholebuzz/fs
External file sorting with external-sorting
Connected using the tree-stream primitives ReadableStreamTree and WritableStreamTree
Used to implement @wholebuzz/mapreduce

Modules

Setup

Global install

$ npm install -g dbcp
$ dbcp --help

Local setup

$ npm init
$ npm install dbcp
$ ./node_modules/.bin/dbcp --help

Examples

API

Write object stream to any source and format

  import { AnyFileSystem } from '@wholebuzz/fs/lib/fs'
  import { LocalFileSystem } from '@wholebuzz/fs/lib/local'
  import { S3FileSystem } from '@wholebuzz/fs/lib/s3'
  import { dbcp } from 'dbcp'
  import StreamTree from 'tree-stream'
  
  const fileSystem = new AnyFileSystem([
    { urlPrefix: 's3://', fs: new S3FileSystem() },
    { urlPrefix: '', fs: new LocalFileSystem() }
  ])
  
  await dbcp({
    fileSystem,
    outputFile: 's3://foo/bar.jsonl',
    // e.g. from level (https://www.npmjs.com/package/level)) database
    inputStream: StreamTree.readable(levelIteratorStream(leveldb.iterator())),
  })

Read object stream from any source and format

  import { openNullWritable } from '@wholebuzz/fs/lib/stream'
  import { dbcp } from 'dbcp'
  import { Transform } from 'stream'

  // Supply transformObject and a do-nothing Writable for outputStream.
  await dbcp({
    fileSystem,
    inputFiles: [ { url: '/tmp/foobar.csv.gz' } ],
    outputStream: [ openNullWritable() ],
    transformObject: (x) => { console.log('test', x) },
  })

  // Or alternatively supply outputStream with outputFormat = object
  await dbcp({
    fileSystem,
    inputFiles: [ { url: '/tmp/foobar.csv.gz' } ],
    // Without outputFormat = object, transform() would receive Buffer
    outputFormat: DatabaseCopyFormat.object,
    outputStream: [
      StreamTree.writable(new Transform({
        objectMode: true,
        transform(data, _, cb) {
          console.log('test', data)
          cb()
        },
      }))
    ],
  })

CLI

Copy PostgreSQL table to Google Cloud Storage gzipped JSON file
Copy MySQL table to Amazon Web Services S3 gzipped JSON-Lines file
Copy Amazon Web Services S3 gzipped JSON-Lines to MySQL table
Copy MongoDB table to four gzipped JSON-Lines shards
Copy SQLServer table to stdout
Output a file or database to stdout
Copy a file from AWS to GCP
Convert file from ND-JSON to JSON
Download a file
Post a file to HTTP endpoint
Create Athena DDL from JSON sample
Create Postgres CREATE TABLE from JSON sample
Split the test data file into four shards
Join the split files back into one

Tested

PASS src/index.test.ts (85.9 s)
  ✓ Should hash test data as string
  ✓ Should hash test data stream
  ✓ Should copy local file 
  ✓ Should read local directory
  ✓ Should convert to JSON from ND-JSON and back
  ✓ Should convert to sharded JSON from ND-JSON and back
  ✓ Should convert to Parquet from ND-JSON and back
  ✓ Should convert to TFRecord from ND-JSON and back
  ✓ Should load to level from ND-JSON and dump to JSON after external sort
  ✓ Should restore to and dump compound data
  ✓ Should restore to and dump from Elastic Search to ND-JSON
  ✓ Should restore to and dump from MongoDB to ND-JSON
  ✓ Should restore to and dump from Postgres to ND-JSON
  ✓ Should restore to and dump from Postgres to SQL
  ✓ Should not hang on error
  ✓ Should copy from Postgres to Mysql
  ✓ Should copy from Postgres to SQL Server
  ✓ Should dump from Postgres to Parquet file
  ✓ Should dump from MySQL to Parquet file
  ✓ Should dump from SQL Server to Parquet file

API Interface

export async function dbcp(args: DatabaseCopyOptions)

export interface DatabaseCopyOptions {
  batchSize?: number
  columnType?: Record<string, string>
  compoundInsert?: boolean
  contentType?: string
  copySchema?: DatabaseCopySchema
  engineOptions?: any
  externalSortBy?: string[]
  extra?: Record<string, any>
  extraOutput?: boolean
  fileSystem?: FileSystem
  group?: boolean
  groupLabels?: boolean
  limit?: number
  orderBy?: string[]
  probeBytes?: number
  query?: string
  shardBy?: string
  schema?: Column[]
  schemaFile?: string
  tempDirectories?: string[]
  transformObject?: (x: unknown) => unknown | Promise<unknown>
  transformObjectStream?: () => Duplex
  transformBytes?: (x: string) => string
  transformBytesStream?: () => Duplex
  where?: Array<string | any[]>
}

CLI Options

$ dbcp --help
cli.js [inputFile] [outputFile]

Options:
  --help            Show help                                          [boolean]
  --version         Show version number                                [boolean]
  --compoundInsert  Compound insert mode can insert associated rows from
                    multiple tables.                                   [boolean]
  --contentType     Content type                                        [string]
  --dataOnly        Dump only the data, not the schema (data definitions).
                                                                       [boolean]
  --dbname          Database                                            [string]
  --externalSortBy  Sort data by property(s) with external-sorting       [array]
  --format
    [choices: "csv", "json", "jsonl", "ndjson", "object", "parquet", "tfrecord",
                                                                          "sql"]
  --group           Group inputs with equinvalent orderBy              [boolean]
  --host            Database host                                       [string]
  --inputFile       Input file                                           [array]
  --inputFormat
    [choices: "csv", "json", "jsonl", "ndjson", "object", "parquet", "tfrecord",
                                                                          "sql"]
  --inputHost       Input host                                          [string]
  --inputName       Input database                                      [string]
  --inputPassword   Input database password                             [string]
  --inputPort       Input database port                                 [string]
  --inputShards     Input shards                                        [number]
  --inputTable      Input database table                                [string]
  --inputType       Input database type
          [string] [choices: "athena", "elasticsearch", "file", "http", "level",
                   "mongodb", "mssql", "mysql", "postgresql", "redis", "sqlite"]
  --inputUser       Input database user                                 [string]
  --limit           Database query LIMIT                                [number]
  --orderBy         Database query ORDER BY                              [array]
  --outputFile      Output file                                         [string]
  --outputFormat
    [choices: "csv", "json", "jsonl", "ndjson", "object", "parquet", "tfrecord",
                                                                          "sql"]
  --outputHost      Output host                                         [string]
  --outputName      Output database                                     [string]
  --outputPassword  Output database password                            [string]
  --outputPort      Output database port                                [string]
  --outputShards    Output shards                                       [number]
  --outputTable     Output database table                               [string]
  --outputType      Output database type
          [string] [choices: "athena", "elasticsearch", "file", "http", "level",
                   "mongodb", "mssql", "mysql", "postgresql", "redis", "sqlite"]
  --outputUser      Output database user                                [string]
  --password        Database password                                   [string]
  --port            Database port                                       [string]
  --probeBytes      Probe bytes                                         [number]
  --query           Query                                               [string]
  --schemaFile      Use schema file if required, instead of schema inspection.
                                                                        [string]
  --schemaOnly      Dump only the object definitions (schema), not data.
                                                                       [boolean]
  --shardBy         Shard (or split) the data based on key              [string]
  --shards          The number of shards to split or join the data      [number]
  --table           Database table                                      [string]
  --user            Database user                                       [string]
  --where           Database query WHERE                                 [array]
  --whereDate       Database query WHERE, final argument parsed as Javascript
                    date                                                 [array]

CLI Examples

Copy PostgreSQL table to Google Cloud Storage gzipped JSON file

$ dbcp \
  --inputType postgresql \
  --host localhost \
  --dbname postgres \
  --port 5433 \
  --user postgres \
  --password postgres \
  --table foobar \
  --outputFile gs://bucket/file.json.gz

Copy MySQL table to Amazon Web Services S3 gzipped JSON-Lines file

$ dbcp \
  --inputType mysql \
  --host localhost \
  --dbname mydb \
  --port 8083 \
  --user root \
  --password wp \
  --table foobar \
  --format jsonl \
  --outputFile s3://bucket/object.jsonl.gz

Copy Amazon Web Services S3 gzipped JSON-Lines to MySQL table

$ dbcp \
  --outputType mysql \
  --host localhost \
  --dbname mydb \
  --port 8083 \
  --user root \
  --password wp \
  --table foobar \
  --inputFile s3://bucket/object.jsonl.gz

Copy SQLServer table to stdout

$ dbcp \
  --inputType mssql \
  --host localhost \
  --dbname mymsdb \
  --port 1433 \
  --user SA \
  --password "MyP@ssw0rd#" \
  --table foobar \
  --outputFile=-

Copy MongoDB table to four gzipped JSON-Lines shards

$ dbcp \
  --inputType mongodb \
  --host localhost \
  --port 27017 \
  --user root \
  --password example \
  --dbname test_db \
  --table dbcptest \
  --outputFile output-SSSS-of-NNNN.jsonl.gz \
  --outputShards 4 \
  --shardBy id

$ ls output*
-rw-r--r--    1 user    staff    782701 Feb  4 10:59 output-0001-of-0004.jsonl.gz
-rw-r--r--    1 user    staff    771980 Feb  4 10:59 output-0003-of-0004.jsonl.gz
-rw-r--r--    1 user    staff    794959 Feb  4 10:59 output-0000-of-0004.jsonl.gz
-rw-r--r--    1 user    staff    788720 Feb  4 10:59 output-0002-of-0004.jsonl.gz

Output a file to stdout

$ dbcp gs://bucket/archive.csv.gz | jq . | less

Copy a file from AWS to GCP

$ dbcp s3://bucket/object.json.gz gs://bucket/file.json.gz

Convert file from ND-JSON to JSON

$ dbcp foobar.jsonl bazbat.json

Download a file

$ dbcp "https://www.w3.org/People/mimasa/test/imgformat/img/w3c_home.png" foo.png

Post a file to HTTP endpoint

$ dbcp "./foo.png" "http://my.api/upload" --contentType "image/png"

Create Athena DDL from JSON sample:

$ dbcp --schemaOnly --inputFile ./sample.jsonl.gz --outputType athena --outputFile ddl.sql

Create Postgres CREATE TABLE from JSON sample:

$ dbcp --schemaOnly --inputFile ./sample.jsonl.gz --outputType postgresql --outputFile ddl.sql

Split the test data file into four shards:

$ dbcp ./test/test.jsonl.gz ./split-SSSS-of-NNNN.jsonl.gz --outputShards 4 --shardBy guid

Join the split files back into one:

$ dbcp ./split-SSSS-of-NNNN.jsonl.gz ./joined.jsonl.gz --inputShards 4 --orderBy id

dbcp / Exports

wholebuzz/dbcp

dbcp

CLI

API

Features

Credits

Modules

Setup

Global install

Local setup

Examples

API

Write object stream to any source and format

Read object stream from any source and format

CLI

Tested

API Interface

CLI Options

CLI Examples

Copy PostgreSQL table to Google Cloud Storage gzipped JSON file

Copy MySQL table to Amazon Web Services S3 gzipped JSON-Lines file

Copy Amazon Web Services S3 gzipped JSON-Lines to MySQL table

Copy SQLServer table to stdout

Copy MongoDB table to four gzipped JSON-Lines shards

Output a file to stdout

Copy a file from AWS to GCP

Convert file from ND-JSON to JSON

Download a file

Post a file to HTTP endpoint

Create Athena DDL from JSON sample:

Create Postgres CREATE TABLE from JSON sample:

Split the test data file into four shards:

Join the split files back into one:

dbcp

Table of contents

Modules