/cd-stream

Set of scripts to stream the binary logs and replicate across datastores.

Primary LanguagePythonApache License 2.0Apache-2.0

CD-Stream

V1.0

CD-Stream is a cross-database CDC driven replicator tool that currently supports replication between MySQL and Postgres.

The Reason Why:

  • Timed Data extraction (Straight forward ETLs) using selects on a production database can be costly and intensive.
  • Cron jobs might have to be scheduled and what if they fail too?

What's New?

In the current version, the support is provided for replication from MySQL and loading the data onto Postgres and new . The loading jobs are queued in redis and processed automatically; thanks to rq workers.

Prerequisite:

Check if binary logging is enabled in your source database. Issue the following command in your source database to verify:

Mysql:

select variable_value as "BINARY LOGGING STATUS (log-bin) :: " from information_schema.global_variables where variable_name='log_bin';

If the above command returns "OFF", make sure that the following lines are added to the /etc/mysql/mysql.conf.d and restart the mysql service:

log_bin                 = mysql-bin
expire_logs_days        = 10
max_binlog_size         = 100M

All Set.. Time to Wrangle!!

Safety first - Put your hard hats on !

  1. Clone the project and Initialize a virtual environment.
$ git clone https://github.com/datawrangl3r/cd-stream.git
$ cd cd-stream
$ python3 -m venv .
$ source bin/activate
$ pip install -r requirements.txt
  1. Configure the streamsql.yml - Tailor it based on your needs
EXTRACTION:
    ENGINE: mysql
    HOST: localhost
    PORT: 3306
    USER: root
    PASS: password
    DB: SOURCEDB
COMMIT:
    ENGINE: postgres
    HOST: localhost
    PORT: 5432
    USER: postgres
    PASS: password
    DB: TARGETDB
QUEUE:
    ENGINE: REDIS
    HOST: localhost
  1. Initialize rq workers in the background:
$ rq worker &
  1. Start Replication and Data Load (Use Supervisor if needed)
$ python main.py &