Do it yourself data integration starter kit. (Treat it as a work of art rather than something useful)
##Version
Name | OS | Platform | data-buddy (GUI) | DataBuddy CLI (command line) |
---|---|---|---|---|
DataBuddy.exe | Windows | 32bit | 0.3.5 beta | [1.23.9 beta] (https://github.com/QueryCopy/QueryCopy-for-Oracle/releases/tag/v1.23.9) |
There are 2 components. Upon "Run" Databuddy GUI (DataBuddy.exe) will kick off Databuddy CLI (
qc32\qc.exe
) out of process.
##Purpose
- It is data integration software used to define technical processes to combine data from different sources. Data is moved across RDBMS borders using CSV files.
- DataBuddy facilitates data delivery from multiple relational data sources including Oracle, SQLServer, DB2, SAP Sybase, Informix, MySQL, Infobright, MariaDB, PostgreSQL, TimesTen, and SQLite.
- It requires minimal initial configuration and lets you manage data integration process using GUI or command line.
- Your data integration processes are stored as session files and can be scripted into your ETL pipelines or used in ad-hoc manner.
- Lets you develop Extract-Copy-Load processes to scrub and ingest large, distinct data sets from multiple sources into a unified data warehouse.
- Provides structured and ad-hoc access to large datasets.
- Databuddy also supports enables seamless, bi-directional integration between all major RDBMSs, such as Oracle, SQLServer, DB2, SAP Sybase, Informix, MySQL, Infobright, MariaDB, and PostgreSQL. In addition, it supports self-service data extraction, preparation, and cleansing by database developers.
##Other scripts
- Oracle -> Redshift data loader
- PostgreSQL -> Redshift data loader
- MySQL -> Redshift data loader
- Oracle -> S3 data loader
- CSV -> Redshift data loader
##Audience
Database developers, ETL developers, Data Integrators.
##Designated Environment Pre-Prod (UAT/QA/DEV)
##Databases supported
###SQL
Database | GUI (DataBuddy) | Command line (DataBuddy CLI) |
---|---|---|
DB2 | yes | yes |
Informix | yes | yes |
MariaDB | yes | yes |
MySQL | yes | yes |
Infobright | yes | yes |
Oracle | yes | yes |
PostgreSQL | yes | yes |
SQLite | yes | yes |
SQLServer | yes | yes |
Sybase | yes | yes |
TimesTen | yes | yes |
###noSQL
Database | GUI (data-buddy) | Command line (DataBuddy CLI) |
---|---|---|
MongoDB | yes | yes |
##Data Tools
Name | GUI (data-buddy) | Command line (DataBuddy CLI) |
---|---|---|
CURL | yes | yes |
##File formats
###SQL
Database | DDL extract | CSV extract | CSV load |
---|---|---|---|
DB2 | yes | yes | |
Informix | yes | yes | |
MariaDB | yes | yes | |
MySQL | yes | yes | |
Oracle | Table | yes | yes |
PostgreSQL | yes | yes | |
SQLite | yes | yes | |
SQLServer | yes | yes | |
Sybase | yes | yes | |
TimesTen | yes | yes |
DDL extract works only for Oracle tables.
###noSQL
Database | CSV extract | CSV load | JSON extract | JSON load |
---|---|---|---|---|
MongoDB | yes | yes | yes | yes |
###Data Tools
Database | CSV extract | CSV load | JSON extract | JSON load |
---|---|---|---|---|
CURL | yes | yes |
##Components
- GUI - data-buddy (wxPython, PyInstaller).
- Command line -DataBuddy CLI (Python, PyInstaller).
##Features:
###Front end, GUI (data-buddy).
- Session management.
###Command line (DataBuddy CLI):
- Multi-query load.
- Partition/sub-partition copy
- Sharded copy (turbo mode)
- Custom spool location (config/user_conf.py)
- config/include/oracle.py - configurable SQL*Loader args.
- 3 generic arguments (use them to pass job_id or timestamp and process in config/user_config.py)
- added all usecases
- lame_duck/limit fix for trial runs
- keep_data_file param (set it to 1 if you want to keep data dump)
- White-space control.
- Header line control.
- Truncate target table/partition/subpartition
- Ask to truncate.
- No client (url) connect.
- Supports CSV file load from multiple dirs.
- --exit_on_key - let's you keep exec window open after load job is done
- file download using curl.exe
- E Extract
- L Load
- C Copy
###SQL
Database | Table | Partition | Subpartition | CSV Files | Queries |
---|---|---|---|---|---|
DB2 | E/L/C | L/E | E/C | ||
Informix | E/L/C | L/E | E/C | ||
MariaDB | E/L/C | L/E | E/C | ||
MySQL | E/L/C | L/E | E/C | ||
Oracle | E/L/C | E/L/C | E/L/C | L/E | E/C |
PostgreSQL | E/L/C | E/L/C | E/L/C | L/E | E/C |
SQLite | E/L/C | L/E | E/C | ||
SQLServer | E/L/C | E/L/C | L/E | E/C | |
Sybase | E/L/C | L/E | E/C | ||
TimesTen | E/L/C | L/E | E/C | ||
###noSQL | |||||
Database | Collection | CSV Files | JSON Files | Queries | |
--------- | -------------- | ----- | ----- | ----- | --- |
MongoDB | E/L/C | E/L | E/L | E/C |
##Tools used to extract, load, and query data
- DbShell - queries target and source for table metadata.
- Spooler - extracts data to temp file from source.
- Loader - loads temp file to target using bulk loader.
####SQL stores.
DB family | Database | Spooler | Loader | DbShell |
---|---|---|---|---|
DB2 | DB2 Advanced Enterprise Server | db2.exe | db2.exe | db2.exe |
DB2 Advanced Workgroup Server | db2.exe | db2.exe | db2.exe | |
DB2 Developer Edition | db2.exe | db2.exe | db2.exe | |
DB2 Express | db2.exe | db2.exe | db2.exe | |
DB2 Express C | db2.exe | db2.exe | db2.exe | |
DB2 Enterprise Server | db2.exe | db2.exe | db2.exe | |
DB2 Workgroup Server | db2.exe | db2.exe | db2.exe | |
MySQL | MySQL | mysql.exe | mysql.exe | mysql.exe |
Infobright | mysql.exe | mysql.exe | mysql.exe | |
MariaDB | mysql.exe | mysql.exe | mysql.exe | |
Informix | Informix IDS | dbaccess.exe | dbaccess.exe | dbaccess.exe |
Informix Innovator C | dbaccess.exe | dbaccess.exe | dbaccess.exe | |
Oracle | Oracle 12c | sqlplus.exe | sqlldr.exe | sqlplus.exe |
Oracle 11g | sqlplus.exe | sqlldr.exe | sqlplus.exe | |
Exadata | sqlplus.exe | sqlldr.exe | sqlplus.exe | |
Oracle XE | sqlplus.exe | sqlldr.exe | sqlplus.exe | |
PostgreSQL | PostgreSQL | psql.exe | psql.exe | psql.exe |
SQL Server | SQL Server Enterprise | sqlcmd.exe | sqlcmd.exe | sqlcmd.exe |
SQL Server Express | sqlcmd.exe | sqlcmd.exe | sqlcmd.exe | |
SAP Sybase | Sybase SQL Anywhere | dbisql.com | dbisql.com | dbisql.com |
SAP Sybase ASE | dbisql.com | dbisql.com | dbisql.com | |
Sybase IQ | dbisql.com | dbisql.com | dbisql.com | |
TimesTen | TimesTen | ttBulkCp.exe | ttBulkCp.exe | ttIsql.exe |
SQLite | SQLite | sqlite3.exe | sqlite3.exe | sqlite3.exe |
####noSQL stores.
Database | Spooler | Loader | DbShell |
---|---|---|---|
MongoDB | mongoexport.exe | mongoimport.exe | mongo.exe |
####Data Tools.
Database | Spooler | Loader | DbShell |
---|---|---|---|
CURL | curl.exe | n/a | n/a |
- It does not create target table.
- It does not pipe data (it extracts into a file then loads).
- It should not be used in Prod. Trial/ad-hoc use only.
##Implementation
- Written using Python (command line) and wxPython (GUI).
- Compiled with PyInstaller
##Configuration
- modify default host_map.py:
- set your 'source' and 'target' dirs for local clients for each database.
'host_list': {0: {'db_env': {'ORA11G': {'source': 'C:\\app\\alex_buz\\product\\11.2.0\\dbhome_2\\BIN',
'target': 'C:\\app\\alex_buz\\product\\11.2.0\\dbhome_2\\BIN'},
- you are good to go
##Execution
- data-buddy.exe
from Windows command line or File Explorer
#Target object Truncate
- it will popup with warning window every time you try to run DataBuddy CLI truncating you target object.
#Templates v.s. free argument entry ##Pros
- all templates are tested with presets
##Cons
- user has to create new session if new argument has to be added/removed
There's no way to add/remove args to your choosing. Argument combos come as templates which you select in "Create new session window"
#TODO
- argument values reuse from existing session. DONE
- clean uargs.db. DONE
- nls_format* duplication. DONE
- test UI.
- add --log_dir to backend. DONE
- add "source" and "target" datasources to "New Session". DONE
- Copy/Paste of argument values between sessions.
- generic "New Session" so user not limited by source and target templates. DONE
- history of values for each argument.
- cleanup all other databases but Oracle.
- more templates and better templates hierarchy.
- init templates to open session for "New Session". DONE
- validate args on Run. DONE.
- smaller test_api files (get default args from test routines, not canned files).
- fix template filters. DONE
- create "Menu" button and hide "About".
- create templates tab DONE
- create sessions tab DONE
- validate session name for chars not usable in file name. DONE
- allow user to create multiple session libraries. DONE
- fix mailformed path from MDD.MultiDirDialog. DONE
- make sure all paths are windows friendly.DONE
- open dir and open file use stale values.DONE
- filter control keys from ones affecting field value. DONE
- close all existing shells/shell groups upon exit.
- highlight running sessions.DONE
- beep on failing sessions.
- free argument entry.DONE
- let user change arg list for a given group. DONE
- detect DONE/FAILED from cmd window. DONE
- let user disable post-etl email (Common: email_to). DONE
- change copy_vector format from db2db to db-db DONE
- fix flicker on frame freeze DONE
- save on close DONE
- Table DDL export for Oracle DONE
- validate all path arguments before run DONE
- add create template menu DONE
- add "--host_map" arg to set hosts for each thread DONE
- let user change host mapping DONE
- save as template DONE
- save as DONE
- create main menu DONE
- let user configure dafault argument values
- let user chenge frame size DONE
- test ezconnect
- add "Output" tab DONE
- add '--compress_spool' arg for zipped output
- add sqlloader.py DONE
- add curl.exe for file download DONE
- load/save app state.
- remove wx.BitmapButton memory leak
- fix session sort DONE
- fix session refresh DONE
- fix awg.flatmenu memory leak DONE
- fix awg.scrollablepanel memory leak
- let user set primary dbs, source/target dbs
- add scp/ftp/sftp as data source/target
- make all paths relative to transport_home DONE
- create wizard for "New Session" template selection
- distinguish A-Templates from B-Templates
- let data copy to be executed on Linux (bash via nssh)DONE
- let data spool/load to be executed on Linux (bash via ssh)
- use netcat for job status instead of keeping open ssh conn
- use named file for QC status report instead of CLI title.
- add Oracle 12c Release 1 DONE
- add Oracle 11g Release 2 DONE
- add SQLServer DONE
- add MySQL DONE
- add PostgreSQL DONE
- add SQLite DONE
- add Informix DONE
- add SAP ASE Sybase DONE
- add DB2 DONE
- add dBase
- add MS Access
- add Pandas
- add [Apache Spark] (https://spark.apache.org/docs/latest/)
- add [Apache Storm] (https://storm.apache.org/documentation/Home.html)
- add Hadoop
- add ZooKeeper
- add HDFS
- add MariaDB DONE
- add Infobright DONE
- add TimesTen DONE
- add [MongoDB] (http://docs.mongodb.org/manual/) DONE
- add [HBase] (http://hbase.apache.org/)
- add [Cassandra] (http://cassandra.apache.org/)
- add [Bigtable] (https://cloud.google.com/bigtable/docs/)
- add [Teradata] (http://www.info.teradata.com/HTMLPubs/DB_TTU_14_00/index.html#page/Storage_Management/B035_2492_071A/2492ch01.084.17.html)
- add [Apache Hive] (http://doc.mapr.com/display/MapR/Hive)
- add Vertica
- add Netezza
- add Parstream
- add ParAccel
- add Redis
- add ZODB
- add MonetDB
- add Vectorwise
- add MarkLogic
- add Amazon SimpleDB
- add Twitter Streaming API (extract)
- add Google Analytics
- add Crate Data
- add Facebook (extract)
- add Taleo HR (extract)
- add Salesforce (load, extract)
- add Greenplum
- add Actian Ingres
- add McObject
- add NuiDB
- add Disco
- add Clustrix
- add OrientDB
- add KDB+
- add Volt DB
- add Elasticsearch
- add Azure DocumentDB
- add Arango DB
- add Foundation DB
- add Enterprise DB
- add Altibase HDB
- add EXASOL
- add Aster
- add Kinesis (extract)
- add Amazon Redshift
- add Amazon Aurora (RDS)
- add RethinkDB
- add Amazon DynamoDB
- add Couchbase
- add Aerospike
- add Riak
- add MemSQL
- add InterSystems Caché, export
- add QV (QlikView file format)
- add MDX (file format)
- add SQream DB
- add [1010data] (https://www.1010data.com/technology)
- add Kognitio
- add SAND Technology
- add InfiniDB
- add SQL Data Warehouse
- Create tests using PyAutoGUI
- Add Amazon Simple Storage Service [S3] (https://aws.amazon.com/s3/)
- Apache Derby (https://db.apache.org/derby/)
##Limitations
- tested to run only on Windows for now (even thou it's wxPython)
- CSV dump files are uncompressed (will add zip compression as option)
- physical copy is done on Windows. Only Oracle copy can be executed on Linux (bash via ssh)
##Performance
- data copy speed mostly depends on your NIC(Ethernet) speed and other factors like how far you are from target and source servers (in terms of network topology and physically).
I've seen 10x performance improvement when I ran it on DEV Linux server (10Gb Ethernet) v.s. my office Windows Desktop (100Mb Ethernet).
##References
- DataBuddy CLI for Oracle --
qc32\qc32.exe
##Vendor Docs
[MongoDB Architecture Guide] (http://s3.amazonaws.com/info-mongodb-com/MongoDB_Architecture_Guide.pdf)
Data Tools | Version |
---|---|
[cURL] (http://curl.haxx.se/download.html) | 7.33 |
- New Session Menu:
- Add New Session:
- Run Session:
- Executed Session:
- Output Tab:
##Copy vector matrix (via CSV)
Database | DB2 | Informix | MariaDB | MySQL | Oracle | PostgreSQL | SQLite | SQLServer | Sybase | TimesTen | MongoDB |
---|---|---|---|---|---|---|---|---|---|---|---|
DB2 | x | x | x | x | x | x | x | x | x | x | x |
Informix | x | x | x | x | x | x | x | x | x | x | x |
MariaDB | x | x | x | x | x | x | x | x | x | x | x |
MySQL | x | x | x | x | x | x | x | x | x | x | x |
Oracle | x | x | x | x | x | x | x | x | x | x | x |
PostgreSQL | x | x | x | x | x | x | x | x | x | x | x |
SQLite | x | x | x | x | x | x | x | x | x | x | x |
SQLServer | x | x | x | x | x | x | x | x | x | x | x |
Sybase | x | x | x | x | x | x | x | x | x | x | x |
TimesTen | x | x | x | x | x | x | x | x | x | x | x |
Mongo DB | x | x | x | x | X | x | x | x | x | x | x |
'x' means feature is implemented and is part of the release 0.3.3
##Copy vector matrix (via JSON)
Database | MongoDB |
---|---|
DB2 | x |
Informix | x |
MariaDB | x |
MySQL | x |
Oracle | x |
PostgreSQL | x |
SQLite | x |
SQLServer | x |
Sybase | x |
TimesTen | x |
Mongo DB | x |
##Databuddy Docs
-
General docs
-
SQL Server
-
MongoDB
-
Oracle
-
CURL
##Download
git clone https://github.com/data-buddy/DataBuddy/
- Data Buddy latest release --
data-buddy 0.3.5
#FAQ
Yes, it is the main purpose of this tool.
Yes, assuming they are doing it on OS Windows.
It is as fast as SQLLoader because I'm using SQLLoader in DIRECT mode.
####How to inscrease upload speed? Parallelize your load by loading multiple partitions in parallel or sharding your data.
You can use SQLLoader or execute insert statements using ODBC/JDBC or SQLPlus.
No, Redshift will not recognize *.zip file format.
You have to gzip
it. You can use 7-Zip to do that.
No
No
You can set Common Args debug_level
to a value>1
- The CSV file you provided is analyzed - number of records in it is estimated.
- Control file is created using target table column names.
- Data is loaded using SQL*Loader
Use --skip_rows 1
to ignore input rows.
Use --nls_date_format "MM/DD/YYYY HH12:MI:SS"
or --nls_timestamp_format "MM/DD/YYYY HH12:MI:SS.FF2 PM"
or --nls_timestamp_tz_format "MM/DD/YYYY HH12:MI:SS.FF2 TZH:TZM"
to control timestamp format.
I used SQL*Loader and wxPython to write it.
Please, contact me for sources.
Yes, please, ask me for new features.
- [Oracle_To_S3_Data_Uploader] (https://github.com/alexbuz/Oracle_To_S3_Data_Uploader) - Stream Oracle data to Amazon- S3.
- [S3_Sanity_Check] (https://github.com/alexbuz/S3_Sanity_Check/blob/master/README.md) - let's you
ping
Amazon-S3 bucket to see if it's publicly readable. - EC2_Metrics_Plotter - plots any CloudWatch EC2 instance metric stats.
- S3_File_Uploader - uploads file from Windows to S3.
Yes, AWS Certified Developer (Associate)
Yes, you can PM me here or email at alex_buz@yahoo.com
.
I'll get back to you within hours.
###Links