Apache Griffin is a model driven data quality solution for modern data systems. It provides a standard process to define data quality measures, execute, report, as well as an unified dashboard across multiple data systems. You can access our home page here. You can access our wiki page here. You can access our issues jira page here.
Snapshot:
Release:
- Install docker and docker compose.
- Pull our pre-built docker image and elasticsearch image.
You can pull the images faster through mirror acceleration if you are in China.
docker pull bhlx3lyx7/svc_msr:0.1.6 docker pull bhlx3lyx7/elasticsearch
docker pull registry.docker-cn.com/bhlx3lyx7/svc_msr:0.1.6 docker pull registry.docker-cn.com/bhlx3lyx7/elasticsearch
- Increase vm.max_map_count of your local machine, to use elasticsearch.
sysctl -w vm.max_map_count=262144
- Copy docker-compose-batch.yml to your work path.
- In your work path, start docker containers by using docker compose, wait for about one minutes, then griffin service is ready.
docker-compose -f docker-compose-batch.yml up -d
- Now you can try griffin APIs by using postman after importing the json files.
In which you need to modify the environment
BASE_PATH
value into<your local IP address>:38080
.
More details about griffin docker here.
- Install jdk (1.8 or later versions).
- Install mysql.
- Install npm (version 6.0.0+).
- Install Hadoop (2.6.0 or later), you can get some help here.
- Install Spark (version 1.6.x, griffin does not support 2.0.x at current), if you want to install Pseudo Distributed/Single Node Cluster, you can get some help here.
- Install Hive (version 1.2.1 or later), you can get some help here. You need to make sure that your spark cluster could access your HiveContext.
- Install Livy, you can get some help here.
Griffin need to schedule spark jobs by server, we use livy to submit our jobs.
For some issues of Livy for HiveContext, we need to download 3 files, and put them into Hdfs.
datanucleus-api-jdo-3.2.6.jar datanucleus-core-3.2.10.jar datanucleus-rdbms-3.2.9.jar
- Install ElasticSearch. ElasticSearch works as a metrics collector, Griffin produces metrics to it, and our default UI get metrics from it, you can use your own way as well.
- Modify configuration for your environment.
You need to modify the configuration part of code, to make Griffin works well in you environment.
service/src/main/resources/application.properties
service/src/main/resources/sparkJob.properties
spring.datasource.url = jdbc:mysql://<your IP>:3306/quartz?autoReconnect=true&useSSL=false spring.datasource.username = <user name> spring.datasource.password = <password> hive.metastore.uris = thrift://<your IP>:9083 hive.metastore.dbname = <hive database name> # default is "default"
ui/js/services/services.jssparkJob.file = hdfs://<griffin measure path>/griffin-measure.jar sparkJob.args_1 = hdfs://<griffin env path>/env.json sparkJob.jars_1 = hdfs://<datanucleus path>/datanucleus-api-jdo-3.2.6.jar sparkJob.jars_2 = hdfs://<datanucleus path>/datanucleus-core-3.2.10.jar sparkJob.jars_3 = hdfs://<datanucleus path>/datanucleus-rdbms-3.2.9.jar sparkJob.uri = http://<your IP>:8998/batches
Configure measure/measure-batch/src/main/resources/env.json for your environment, and put it into Hdfs /ES_SERVER = "http://<your IP>:9200"
- Build the whole project and deploy.(NPM should be installed , on mac you can try 'brew install node')
Create a directory in Hdfs, and put our measure package into it.
mvn install
After all our environment services startup, we can start our server.cp /measure/target/measure-0.1.3-incubating-SNAPSHOT.jar /measure/target/griffin-measure.jar hdfs dfs -put /measure/target/griffin-measure.jar <griffin measure path>/
After a few seconds, we can visit our default UI of Griffin (by default the port of spring boot is 8080).java -jar service/target/service.jar
http://<your IP>:8080
- Follow the steps using UI here.
Note: The front-end UI is still under development, you can only access some basic features currently.
See CONTRIBUTING.md for details on how to contribute code, documentation, etc.