/cubetl

CubETL - Framework and tool for data ETL (Extract, Transform and Load) in Python

Primary LanguagePythonMIT LicenseMIT

CubETL

CubETL is a tool for data ETL (Extract, Transform and Load).

CubETL provides a mechanism to run data items through a processing pipeline. It takes care of initializing only the components used by the process, manages the data flow across the process graph, logging and cleanup. It can be used as a tool or from Python code.

Features:

  • Consumes and produces CSV, XML, JSON...
  • SQL support (querying, inserting/updating, schema creation, schema loading)
  • OLAP support:
    • Star-schema generation and data loading
    • SQL-to-OLAP schema generator
    • Cubes OLAP Server model export
    • SDMX and PC-Axis read support
  • Text templating, GeoIP, HTTP requests...
  • Insert / upsert for memory tables, SQL tables and OLAP entities
  • Extensible
  • Caching

It provides out-of-the-box components that handle common data formats and it also includes SQL and OLAP modules that understand SQL and OLAP schemas and map data across them. This allows CubETL to insert OLAP facts across multiple tables in a single store operation, while automatically performing (and caching) the appropriate lookups.

CubETL can also inspect an existing relational database and generate an OLAP schema, and the other way around. It can also produce configuration for Databrewery Cubes Server. All together it allows for a quick analytical inspection of an arbitrary database (see the examples below).

Note: This project is in development stage and is tested in few environments. Documentation is incomplete. You will hit issues. Please use the issue tracker for bugs, questions, suggestions and contributions.

Download / Install

While CubETL is in development no pip packages are provided, and it should be installed using python setup.py develop. Using a virtualenv or docker instance is recommended.

git clone https://github.com/jjmontesl/cubetl.git
cd cubetl

# Using a virtualenv is usually recommended:
python3 -m venv env
. env/bin/activate

# Install dependencies (this command is for Ubuntu)
sudo apt-get install python3-dev libxml2-dev libxslt1-dev zlib1g-dev

# Install CubETL (in development mode, so you can make changes to the source)
python setup.py develop

# Test
cubetl -h

Install on docker

First, install Docker. Then:

docker-compose build

Test the executable (quick):

docker-compose run cubetls /usr/local/bin/cubetl -h

Run pytest tests (this is currently quite slow: 20-30 minutes):

docker-compose run cubetls pytest -v

Run one pytest:

docker-compose run cubetls python -m pytest -v -k test_config_new

Usage

Cubetl provides a command line tool, cubetl:

cubetl [-dd] [-q] [-h] [-p property=value] [-m attribute=value] [config.py ...] <start-node>

    -p   set a context property
    -m   set an attribute for the start item
    -d   debug mode (can be used twice for extra debug)
    -q   quiet mode (bypass print nodes)
    -l   list config nodes ('cubetl.config.list' as start-node)
    -h   show this help and exit
    -v   print version and exit

Each CubETL configuration can contain one or more process nodes. You must specify the list of configuration files (.py files), followed by the process nodes you want to run.

Note that, when running a complete ETL process, you need to remove print nodes or to use the -q command line option to remove prints to standard output, which will otherwise heavily slowdown the process.

You can also use CubETL directly from Python code.

Examples

Visualizing a SQL database

CubETL can inspect a SQL database and generate a CubETL OLAP schema and SQL mappings for it. Such schema can then be visualized using CubesViewer:

# Inspect database and generate a cubes model and config
cubetl cubetl.sql.db2sql cubetl.olap.sql2olap cubetl.cubes.olap2cubes \
    -p db2sql.db_url=sqlite:///mydb.sqlite3 \
    -p olap2cubes.cubes_model=mydb.cubes-model.json \
    -p olap2cubes.cubes_config=mydb.cubes-config.ini

# Run cubes server (in background with &)
pip install https://github.com/DataBrewery/cubes/archive/master.zip click flask --upgrade
slicer serve mydb.cubes-config.ini &

# Run a local cubesviewer HTTP server (also opens a browser)
# NOTE: not yet available, please download and use CubesViewer manually!
pip install cubesviewer-utils
cvutils cv

This will open a browser pointing to a local CubesViewer instance pointing to the previously launched Cubes server. Alternatively, you can download CubesViewer and load the HTML application locally.

The CubETL project contains an example database that you can use to test this (see the Generate OLAP schema from SQL database and visualize example below).

You can control the schema generation process using with options. Check the documentation below for further information.

Creating a new ETL process config

Create a new directory for your ETL process and inside it run:

cubetl cubetl.config.new -p config.name=myprocess

This will create a new file myprocess.py, which you can use as a template for your new ETL process.

The created example config includes an entry node called myprocess.process, which simply prints a message to console. You can test your ETL process using:

cubetl myprocess.py myprocess.process

See the example ETL below fore more examples, the documentation section for information about how to define ETL processes.

Example ETL processes

Example ETL processes included with the project:

To run these examples you'll need the examples directory of the cubetl project, which is not included in the PyPI pip download. You can get them by cloning the cubetl project repository (git clone https://github.com/jjmontesl/cubetl.git) or by downloading the packaged project.

Running from Python

In order to configure and/or run a process from client code, use:

import cubetl

# Create Cubetl context
ctx = cubetl.cubetl()

# Add components or include a configuration file...
ctx.add('your_app.node_name', ...)

# Launch process
result = ctx.run("your_app.node_name")

See the examples/python directory for a full working example.

Documentation

Guide:

  • Sequential formats (text files, CSV, JSON...)
  • Tables
  • Process flow (unions, chain forks, repeats...)
  • SQL
  • OLAP
  • Script nodes, custom functions, custom nodes and custom components

Config Library:

Doc:

Examples:

  • See "Example ETL processes" above

Support

If you have questions, problems or suggestions, please use:

If you are using or trying CubETL, please tweet #cubetl.

Source

Github source repository: https://github.com/jjmontesl/cubetl

Authors

CubETL is written and maintained by Jose Juan Montes.

See AUTHORS file for more information.

License

CubETL is published under MIT license. For full license see the LICENSE file.

Other sources: