/Databuddy

Data migration tool for Oracle, SQL Server, MySQL, PostgreSQL and other databases

Primary LanguagePython

DataBuddy

Do it yourself data integration starter kit. (Treat it as a work of art rather than something useful)

##Version

Name OS Platform data-buddy (GUI) DataBuddy CLI (command line)
DataBuddy.exe Windows 32bit 0.3.5 beta [1.23.9 beta] (https://github.com/QueryCopy/QueryCopy-for-Oracle/releases/tag/v1.23.9)

There are 2 components. Upon "Run" Databuddy GUI (DataBuddy.exe) will kick off Databuddy CLI (qc32\qc.exe) out of process.

##Purpose

  • It is data integration software used to define technical processes to combine data from different sources. Data is moved across RDBMS borders using CSV files.
  • DataBuddy facilitates data delivery from multiple relational data sources including Oracle, SQLServer, DB2, SAP Sybase, Informix, MySQL, Infobright, MariaDB, PostgreSQL, TimesTen, and SQLite.
  • It requires minimal initial configuration and lets you manage data integration process using GUI or command line.
  • Your data integration processes are stored as session files and can be scripted into your ETL pipelines or used in ad-hoc manner.
  • Lets you develop Extract-Copy-Load processes to scrub and ingest large, distinct data sets from multiple sources into a unified data warehouse.
  • Provides structured and ad-hoc access to large datasets.
  • Databuddy also supports enables seamless, bi-directional integration between all major RDBMSs, such as Oracle, SQLServer, DB2, SAP Sybase, Informix, MySQL, Infobright, MariaDB, and PostgreSQL. In addition, it supports self-service data extraction, preparation, and cleansing by database developers.

##Other scripts

##Audience

Database developers, ETL developers, Data Integrators.

##Designated Environment Pre-Prod (UAT/QA/DEV)

##Databases supported

###SQL

Database GUI (DataBuddy) Command line (DataBuddy CLI)
DB2 yes yes
Informix yes yes
MariaDB yes yes
MySQL yes yes
Infobright yes yes
Oracle yes yes
PostgreSQL yes yes
SQLite yes yes
SQLServer yes yes
Sybase yes yes
TimesTen yes yes

###noSQL

Database GUI (data-buddy) Command line (DataBuddy CLI)
MongoDB yes yes

##Data Tools

Name GUI (data-buddy) Command line (DataBuddy CLI)
CURL yes yes

##File formats

###SQL

Database DDL extract CSV extract CSV load
DB2 yes yes
Informix yes yes
MariaDB yes yes
MySQL yes yes
Oracle Table yes yes
PostgreSQL yes yes
SQLite yes yes
SQLServer yes yes
Sybase yes yes
TimesTen yes yes

DDL extract works only for Oracle tables.

###noSQL

Database CSV extract CSV load JSON extract JSON load
MongoDB yes yes yes yes

###Data Tools

Database CSV extract CSV load JSON extract JSON load
CURL yes yes

##Components

  • GUI - data-buddy (wxPython, PyInstaller).
  • Command line -DataBuddy CLI (Python, PyInstaller).

##Features:

###Front end, GUI (data-buddy).

  • Session management.

###Command line (DataBuddy CLI):

  • Multi-query load.
  • Partition/sub-partition copy
  • Sharded copy (turbo mode)
  • Custom spool location (config/user_conf.py)
  • config/include/oracle.py - configurable SQL*Loader args.
  • 3 generic arguments (use them to pass job_id or timestamp and process in config/user_config.py)
  • added all usecases
  • lame_duck/limit fix for trial runs
  • keep_data_file param (set it to 1 if you want to keep data dump)
  • White-space control.
  • Header line control.
  • Truncate target table/partition/subpartition
  • Ask to truncate.
  • No client (url) connect.
  • Supports CSV file load from multiple dirs.
  • --exit_on_key - let's you keep exec window open after load job is done
  • file download using curl.exe

Pipeline types

  • E Extract
  • L Load
  • C Copy

###SQL

Database Table Partition Subpartition CSV Files Queries
DB2 E/L/C L/E E/C
Informix E/L/C L/E E/C
MariaDB E/L/C L/E E/C
MySQL E/L/C L/E E/C
Oracle E/L/C E/L/C E/L/C L/E E/C
PostgreSQL E/L/C E/L/C E/L/C L/E E/C
SQLite E/L/C L/E E/C
SQLServer E/L/C E/L/C L/E E/C
Sybase E/L/C L/E E/C
TimesTen E/L/C L/E E/C
###noSQL
Database Collection CSV Files JSON Files Queries
--------- -------------- ----- ----- ----- ---
MongoDB E/L/C E/L E/L E/C

##Tools used to extract, load, and query data

  1. DbShell - queries target and source for table metadata.
  2. Spooler - extracts data to temp file from source.
  3. Loader - loads temp file to target using bulk loader.

####SQL stores.

DB family Database Spooler Loader DbShell
DB2 DB2 Advanced Enterprise Server db2.exe db2.exe db2.exe
DB2 Advanced Workgroup Server db2.exe db2.exe db2.exe
DB2 Developer Edition db2.exe db2.exe db2.exe
DB2 Express db2.exe db2.exe db2.exe
DB2 Express C db2.exe db2.exe db2.exe
DB2 Enterprise Server db2.exe db2.exe db2.exe
DB2 Workgroup Server db2.exe db2.exe db2.exe
MySQL MySQL mysql.exe mysql.exe mysql.exe
Infobright mysql.exe mysql.exe mysql.exe
MariaDB mysql.exe mysql.exe mysql.exe
Informix Informix IDS dbaccess.exe dbaccess.exe dbaccess.exe
Informix Innovator C dbaccess.exe dbaccess.exe dbaccess.exe
Oracle Oracle 12c sqlplus.exe sqlldr.exe sqlplus.exe
Oracle 11g sqlplus.exe sqlldr.exe sqlplus.exe
Exadata sqlplus.exe sqlldr.exe sqlplus.exe
Oracle XE sqlplus.exe sqlldr.exe sqlplus.exe
PostgreSQL PostgreSQL psql.exe psql.exe psql.exe
SQL Server SQL Server Enterprise sqlcmd.exe sqlcmd.exe sqlcmd.exe
SQL Server Express sqlcmd.exe sqlcmd.exe sqlcmd.exe
SAP Sybase Sybase SQL Anywhere dbisql.com dbisql.com dbisql.com
SAP Sybase ASE dbisql.com dbisql.com dbisql.com
Sybase IQ dbisql.com dbisql.com dbisql.com
TimesTen TimesTen ttBulkCp.exe ttBulkCp.exe ttIsql.exe
SQLite SQLite sqlite3.exe sqlite3.exe sqlite3.exe

####noSQL stores.

Database Spooler Loader DbShell
MongoDB mongoexport.exe mongoimport.exe mongo.exe

####Data Tools.

Database Spooler Loader DbShell
CURL curl.exe n/a n/a

What it doesn't do

  • It does not create target table.
  • It does not pipe data (it extracts into a file then loads).
  • It should not be used in Prod. Trial/ad-hoc use only.

##Implementation

  • Written using Python (command line) and wxPython (GUI).
  • Compiled with PyInstaller

##Configuration

  • modify default host_map.py: host_map.py
  • set your 'source' and 'target' dirs for local clients for each database.
 'host_list': {0: {'db_env': {'ORA11G': {'source': 'C:\\app\\alex_buz\\product\\11.2.0\\dbhome_2\\BIN',
                                         'target': 'C:\\app\\alex_buz\\product\\11.2.0\\dbhome_2\\BIN'},
  • you are good to go

##Execution

  • data-buddy.exe

from Windows command line or File Explorer

#Target object Truncate

  • it will popup with warning window every time you try to run DataBuddy CLI truncating you target object. truncate

#Templates v.s. free argument entry ##Pros

  • all templates are tested with presets

##Cons

  • user has to create new session if new argument has to be added/removed

There's no way to add/remove args to your choosing. Argument combos come as templates which you select in "Create new session window"

#TODO

##Limitations

  • tested to run only on Windows for now (even thou it's wxPython)
  • CSV dump files are uncompressed (will add zip compression as option)
  • physical copy is done on Windows. Only Oracle copy can be executed on Linux (bash via ssh)

##Performance

  • data copy speed mostly depends on your NIC(Ethernet) speed and other factors like how far you are from target and source servers (in terms of network topology and physically).

I've seen 10x performance improvement when I ran it on DEV Linux server (10Gb Ethernet) v.s. my office Windows Desktop (100Mb Ethernet).

##References

##Vendor Docs

Database Version Export Import DbShell File formats
Oracle 12c 12c Release 1 SQL*Plus SQL*Loader CSV, DDL
Oracle 11g 11g Release 2
Disco
Crate Data
MongoDB 3.0 mongoexport mongoimport mongo shell CSV, JSON
TimesTen Release 2
Amazon DynamoDB
Hadoop r2.7.0
EXASOL 5.0.3 pdf
Kinesis
Netezza 7.2.0
KDB+
Aster
Aerospike
Couchbase
Redshift
Enterprise DB
Foundation DB
Arango DB
Volt DB
OrientDB
Clustrix
NuiDB
McObject
Actian Ingres
Greenplum
Salesforce DataLoader
Twitter Streaming API
Google Analytics
Vertica 7-1-x
SQLServer
MySQL
PostgreSQL
SQLite
Informix 12.1.0
SAP ASE Sybase
DB2 10.5.0
dBase
MS Access 10.0
Pandas
Apache Spark
Apache Storm
MariaDB
Infobright
[Cassandra] (http://cassandra.apache.org/)
[HBase] (http://hbase.apache.org/)
[Teradata] (http://www.info.teradata.com/HTMLPubs/DB_TTU_14_00/index.html#page/Storage_Management/B035_2492_071A/2492ch01.084.17.html)
[Bigtable] (https://cloud.google.com/bigtable/docs/)
[Apache Hive] (http://doc.mapr.com/display/MapR/Hive)
InterSystems Caché export
Altibase HDB
ZooKeeper
HDFS
Amazon SimpleDB
Amazon Redshift
Azure DocumentDB
Elasticsearch
Riak
MemSQL
MarkLogic
SQream DB
[1010data] (https://www.1010data.com/technology)
Kognitio
SAND Technology
Redis
ZODB
Amazon Simple Storage Service/S3
Amazon Aurora
RethinkDB
[Apache Derby] (https://db.apache.org/derby/)
InfiniDB
SQL Data Warehouse

Vendor architecture

[MongoDB Architecture Guide] (http://s3.amazonaws.com/info-mongodb-com/MongoDB_Architecture_Guide.pdf)

Data Tools Version
[cURL] (http://curl.haxx.se/download.html) 7.33

Screenshots

  • New Session Menu:

New Session Menu

  • Add New Session:

Add New Session

  • Run Session:

Run Session

  • Executed Session:

Executed Session

  • Output Tab:

Output Tab

##Copy vector matrix (via CSV)

Database DB2 Informix MariaDB MySQL Oracle PostgreSQL SQLite SQLServer Sybase TimesTen MongoDB
DB2 x x x x x x x x x x x
Informix x x x x x x x x x x x
MariaDB x x x x x x x x x x x
MySQL x x x x x x x x x x x
Oracle x x x x x x x x x x x
PostgreSQL x x x x x x x x x x x
SQLite x x x x x x x x x x x
SQLServer x x x x x x x x x x x
Sybase x x x x x x x x x x x
TimesTen x x x x x x x x x x x
Mongo DB x x x x X x x x x x x

'x' means feature is implemented and is part of the release 0.3.3

##Copy vector matrix (via JSON)

Database MongoDB
DB2 x
Informix x
MariaDB x
MySQL x
Oracle x
PostgreSQL x
SQLite x
SQLServer x
Sybase x
TimesTen x
Mongo DB x

##Databuddy Docs

##Download

#FAQ

Can it load CSV file from Windows desktop to Oracle database.

Yes, it is the main purpose of this tool.

Can developers integrate CSV loader into their ETL pipelines?

Yes, assuming they are doing it on OS Windows.

How fast is data upload using Databuddy?

It is as fast as SQLLoader because I'm using SQLLoader in DIRECT mode.

####How to inscrease upload speed? Parallelize your load by loading multiple partitions in parallel or sharding your data.

What are the other ways to upload file to Oracle?

You can use SQLLoader or execute insert statements using ODBC/JDBC or SQLPlus.

Can I just zip it using Windows File Explorer?

No, Redshift will not recognize *.zip file format. You have to gzip it. You can use 7-Zip to do that.

Does it delete source file load?

No

Does it create target Oracle table?

No

I'm experiencing errors in Oracle. How can I debug?

You can set Common Args debug_level to a value>1

Explain how it works?

  • The CSV file you provided is analyzed - number of records in it is estimated.
  • Control file is created using target table column names.
  • Data is loaded using SQL*Loader

How do I skip the header record in input CSV file?

Use --skip_rows 1 to ignore input rows.

How do i set custom timestamp format for Oracle load?

Use --nls_date_format "MM/DD/YYYY HH12:MI:SS" or --nls_timestamp_format "MM/DD/YYYY HH12:MI:SS.FF2 PM" or --nls_timestamp_tz_format "MM/DD/YYYY HH12:MI:SS.FF2 TZH:TZM" to control timestamp format.

What technology was used to create this tool

I used SQL*Loader and wxPython to write it.

Where are the sources?

Please, contact me for sources.

Can you modify functionality and add features?

Yes, please, ask me for new features.

What other AWS tools you've created?

Do you have any AWS Certifications?

Yes, AWS Certified Developer (Associate)

Can you create similar/custom data tool for our business?

Yes, you can PM me here or email at alex_buz@yahoo.com. I'll get back to you within hours.

###Links