DataGraft Platform

This repository contains startup and configuration scripts to configure, deploy and run the DataGraft Platform as a multi-container Docker application using Docker Compose. Each microservice's release of the DataGraft Platform is published as a pre-built Docker image on the Docker Hub repository (https://hub.docker.com/u/datagraft/).

Architecture

The DataGraft Platform consists of two main parts:

DataGraft
Grafterizer

Both of these parts have been implemented through a microservice architecture consisting of a large number of sub-components, each contained in a Docker container. The DataGraft and Grafterizer tool user interfaces have been integrated to provide a consistent user experience, whereby their connected microservices communicate with each other through REST. The individual components are illustrated in the figure below.

DataGraft consists of the following components:

DataGraft Portal: The portal serves several functions. Firstly, it provides the web-based front-end that is used by the data publishers. Internally, it implements the data model and provides object-relational mapping between it and the database back-end. It also enables the communication with the database and manages the storage of uploaded files (Docker volume, or Amazon RDS in production). Finally, this component implements the connection to the data hosting and access services.
DataGraft DBMS: This component represents the database management system (PostgreSQL ) for the user data and asset catalogue. Data are stored in a separate volume (Docker volume or Amazon S3 in production).

Grafterizer has the following sub-components:

Grafterizer: Front-end component that implements the interactive GUI for data cleaning and data transformations.
Grafterizer dispatch service: A server component for the Grafterizer front-end that handles request authentication on its behalf (in order to ensure security) and dispatches requests for input and output across the multiple services.
Graftwerk: A sandboxed server component that executes the data cleaning and transformation scripts that are generated by the Grafterizer front-end over the set of input data sent by the dispatcher. Graftwerk uses a proprietary load-balancing component in order to distribute the traffic coming when a larger number of users use the transformation tool.
Graftwerk cache: A FIFO cache service for the Grafterizer front-end requests to Graftwerk.
Vocabulary manager: Simple RDF vocabulary management service for imported vocabularies used in the RDF mapping in the front-end. Enables searching through concepts and importing.
Jarfter: A web service component for compiling executable JARs for transformations generated by the Grafterizer front end.

Installation & Deployment

This repository contains the following files:

docker-compose.yml - Docker Compose definition file to run the DataGraft Platform as a multi-container Docker application
startup.sh - startup script that should be executed the first time you run the DataGraft Platform in order to create an admin user account

Docker compose configuration

To deploy and run the DataGraft Platform:

docker-compose pull
docker-compose up

The default docker-compose.yml file is configured to deploy and run the DataGraft Platform on localhost using Postgres. Edit the environment variables for each service to change the deployment settings:

database

POSTGRES_PASSWORD - password of the admin user (defined in startup.sh)
POSTGRES_DB - name of database (defined in startup.sh)

datagraft-portal

DATABASE_URL - URL of the Postgres database
DATABASE_HOST - host of the Postgres database
DATABASE_PASSWORD - password of the admin user (defined in startup.sh)
RAILS_ENV - Rails environment [development | staging | test ]
SECRET_KEY_BASE - secret key base
GRAFTERIZER_PUBLIC_PATH - URL of Grafterizer service
GRAFTWERK_URI - URI of Graftwerk service
DATAGRAFT_DEPLOY_HOST - host of the DataGraft Portal service
DATAGRAFT_DEPLOY_PORT - port of the DataGraft Portal service

grafterizer-dispatch-service

COOKIE_STORE_SECRET - cookie store secret
OAUTH2_CLIENT_ID - Grafterizer UID (retrieved when configuring/starting Grafterizer in DataGraft)
OAUTH2_CLIENT_SECRET - Grafterizer secret key (retrieved when configuring/starting Grafterizer in DataGraft)
GRAFTWERK_URI - URI of the Graftwerk service
GRAFWERK_CACHE_URI - URI of the Graftwerk cache service
DATAGRAFT_URI - Public URI of the Grafterizer Portal service
CORS_ORIGIN - Public URI of the backend server
PUBLIC_CALLBACK_SERVER - Same as DATAGRAFT_URI by default
PUBLIC_OAUTH2_SITE - URI of OAUTH2 server

Startup script

The first time you run the DataGraft Platform you will need to create an admin user account:

startup.sh

Edit the startup.sh script if you want to change the default login name 'administrator@datagraft.net' and the default password 'password' for the admin user account. Make sure that the database setting 'datagraft-dev' (default) matches the environment settings in the docker-compose.yml file.

Amazon S3 and RDS configuration

The default docker-compose.yml file is configured to deploy and run DataGraft Platform on localhost using Postgres. To configure it for the cloud deployment on Amazon S3 with Amazon RDS you will need to:

Remove the database service from the docker-compose.yml file.
Change the environment variables for the datagraft-portal service.

Instead of running startup.sh script to add a Postgres administrator user, you need to set up an AWS Identity and Access Management (AIM) user for the DataGraft Platform.

database

Remove the database entry under services.

datagraft-portal

Remove dependencies to the Postgres service under links:

-database:database-host

Remove execution of the startup script under commands:

bash startup.sh

Remove the following environments that used for Postgres:

DATABASE_URL - URL of the Postgres database
DATABASE_HOST - host of the Postgres database
DATABASE_PASSWORD - password of the admin user (defined in startup.sh)

Add the following environment variables used for AWS S3 and RDS.

AWS_RDS_DB_NAME - name of the database
AWS_RDS_DB_USERNAME - name of the database user
AWS_RDS_DB_PASSWORD - password for the database
AWS_RDS_DB_HOST - S3 host running the database
AWS_S3_BUCKET_NAME - S3 buckket name
AWS_S3_ACCESS_KEY_ID - S3 access key ID
AWS_S3_ACCESS_KEY_SECRET - S3 access key secret
AWS_S3_REGION - S3 region
SES_SMTP_USERNAME - SMTP username
SES_SMTP_PASSWORD - SMTP password

Questions or issues?

For posting information about bugs, questions and discussions please use the Github Issues feature.

datagraft/datagraft-platform

DataGraft Platform

Architecture

Installation & Deployment

Docker compose configuration

database

datagraft-portal

grafterizer-dispatch-service

Startup script

Amazon S3 and RDS configuration

database

datagraft-portal

Questions or issues?

Development Team

Core Team

Contributors

License