/datagraft-platform

DataGraft Platform

Primary LanguageShellEclipse Public License 1.0EPL-1.0

DataGraft Platform

This repository contains startup and configuration scripts to configure, deploy and run the DataGraft Platform as a multi-container Docker application using Docker Compose. Each microservice's release of the DataGraft Platform is published as a pre-built Docker image on the Docker Hub repository (https://hub.docker.com/u/datagraft/).

Architecture

The DataGraft Platform consists of two main parts:

  1. DataGraft
  2. Grafterizer

Both of these parts have been implemented through a microservice architecture consisting of a large number of sub-components, each contained in a Docker container. The DataGraft and Grafterizer tool user interfaces have been integrated to provide a consistent user experience, whereby their connected microservices communicate with each other through REST. The individual components are illustrated in the figure below.

datagraft-platform

DataGraft consists of the following components:

  • DataGraft Portal: The portal serves several functions. Firstly, it provides the web-based front-end that is used by the data publishers. Internally, it implements the data model and provides object-relational mapping between it and the database back-end. It also enables the communication with the database and manages the storage of uploaded files (Docker volume, or Amazon RDS in production). Finally, this component implements the connection to the data hosting and access services.
  • DataGraft DBMS: This component represents the database management system (PostgreSQL ) for the user data and asset catalogue. Data are stored in a separate volume (Docker volume or Amazon S3 in production).

Grafterizer has the following sub-components:

  • Grafterizer: Front-end component that implements the interactive GUI for data cleaning and data transformations.
  • Grafterizer dispatch service: A server component for the Grafterizer front-end that handles request authentication on its behalf (in order to ensure security) and dispatches requests for input and output across the multiple services.
  • Graftwerk: A sandboxed server component that executes the data cleaning and transformation scripts that are generated by the Grafterizer front-end over the set of input data sent by the dispatcher. Graftwerk uses a proprietary load-balancing component in order to distribute the traffic coming when a larger number of users use the transformation tool.
  • Graftwerk cache: A FIFO cache service for the Grafterizer front-end requests to Graftwerk.
  • Vocabulary manager: Simple RDF vocabulary management service for imported vocabularies used in the RDF mapping in the front-end. Enables searching through concepts and importing.
  • Jarfter: A web service component for compiling executable JARs for transformations generated by the Grafterizer front end.

Installation & Deployment

This repository contains the following files:

  • docker-compose.yml - Docker Compose definition file to run the DataGraft Platform as a multi-container Docker application
  • startup.sh - startup script that should be executed the first time you run the DataGraft Platform in order to create an admin user account

Docker compose configuration

To deploy and run the DataGraft Platform:

docker-compose pull
docker-compose up

The default docker-compose.yml file is configured to deploy and run the DataGraft Platform on localhost using Postgres. Edit the environment variables for each service to change the deployment settings:

database

  • POSTGRES_PASSWORD - password of the admin user (defined in startup.sh)
  • POSTGRES_DB - name of database (defined in startup.sh)

datagraft-portal

  • DATABASE_URL - URL of the Postgres database
  • DATABASE_HOST - host of the Postgres database
  • DATABASE_PASSWORD - password of the admin user (defined in startup.sh)
  • RAILS_ENV - Rails environment [development | staging | test ]
  • SECRET_KEY_BASE - secret key base
  • GRAFTERIZER_PUBLIC_PATH - URL of Grafterizer service
  • GRAFTWERK_URI - URI of Graftwerk service
  • DATAGRAFT_DEPLOY_HOST - host of the DataGraft Portal service
  • DATAGRAFT_DEPLOY_PORT - port of the DataGraft Portal service

grafterizer-dispatch-service

  • COOKIE_STORE_SECRET - cookie store secret
  • OAUTH2_CLIENT_ID - Grafterizer UID (retrieved when configuring/starting Grafterizer in DataGraft)
  • OAUTH2_CLIENT_SECRET - Grafterizer secret key (retrieved when configuring/starting Grafterizer in DataGraft)
  • GRAFTWERK_URI - URI of the Graftwerk service
  • GRAFWERK_CACHE_URI - URI of the Graftwerk cache service
  • DATAGRAFT_URI - Public URI of the Grafterizer Portal service
  • CORS_ORIGIN - Public URI of the backend server
  • PUBLIC_CALLBACK_SERVER - Same as DATAGRAFT_URI by default
  • PUBLIC_OAUTH2_SITE - URI of OAUTH2 server

Startup script

The first time you run the DataGraft Platform you will need to create an admin user account:

startup.sh

Edit the startup.sh script if you want to change the default login name 'administrator@datagraft.net' and the default password 'password' for the admin user account. Make sure that the database setting 'datagraft-dev' (default) matches the environment settings in the docker-compose.yml file.

Amazon S3 and RDS configuration

The default docker-compose.yml file is configured to deploy and run DataGraft Platform on localhost using Postgres. To configure it for the cloud deployment on Amazon S3 with Amazon RDS you will need to:

  • Remove the database service from the docker-compose.yml file.
  • Change the environment variables for the datagraft-portal service.

Instead of running startup.sh script to add a Postgres administrator user, you need to set up an AWS Identity and Access Management (AIM) user for the DataGraft Platform.

database

Remove the database entry under services.

datagraft-portal

Remove dependencies to the Postgres service under links:

  • -database:database-host

Remove execution of the startup script under commands:

  • bash startup.sh

Remove the following environments that used for Postgres:

  • DATABASE_URL - URL of the Postgres database
  • DATABASE_HOST - host of the Postgres database
  • DATABASE_PASSWORD - password of the admin user (defined in startup.sh)

Add the following environment variables used for AWS S3 and RDS.

  • AWS_RDS_DB_NAME - name of the database
  • AWS_RDS_DB_USERNAME - name of the database user
  • AWS_RDS_DB_PASSWORD - password for the database
  • AWS_RDS_DB_HOST - S3 host running the database
  • AWS_S3_BUCKET_NAME - S3 buckket name
  • AWS_S3_ACCESS_KEY_ID - S3 access key ID
  • AWS_S3_ACCESS_KEY_SECRET - S3 access key secret
  • AWS_S3_REGION - S3 region
  • SES_SMTP_USERNAME - SMTP username
  • SES_SMTP_PASSWORD - SMTP password

Questions or issues?

For posting information about bugs, questions and discussions please use the Github Issues feature.

Development Team

Core Team

Contributors

License

Available under the Eclipse Public License (v1.0).