Overview

DataHem is a serverless real-time end-2-end ML pipeline built entirely on Google Cloud Platform services - AppEngine, PubSub, Dataflow, BigQuery and Cloud ML.

Benefits

When building ML/Data products, your most valuable asset is your data. Hence, the purpose of DataHem is to give you:

full control and ownership of your data
unsampled data
data in real time
the ability to replay/reprocess your data unlimited times
data synergies -> collect once and use for multiple purposes (reporting, analytics and building data/ML products)
low cost of operations and maintenance
scalability
data as a stream and at rest
activation of data
ability to delete data on a row by row basis

Target architecture

Use cases

1. Digital Analytics

The first use is to leverage your implementation of Google Analytics / Measurement Protocol. Google Analytics is awesome, but has some limitations worth to address in order to take reporting, analytics and machine learning to the next level. By adding a custom task to your Google Analytics tracker, DataHem eliminates many of the limitations of both the free and the premium version of Google Analytics and gives you:

Unsampled data
Real-time data
Unlimited custom dimensions and metrics
Unlimited data volume
Enriched data as a stream
Unlimited reprocessing of data
No licensing fees (open source)

License

DataHem is licensed under AGPL 3.0 or later

DataHem ecosystem

The architecute of DataHem consists of loosely coupled parts to enable future replacements and extensions of parts.

tracker: Send data to the collector, currently supporting Google Analytics javascript tracker
collector: Collect data sent from trackers and publish the data on pubsub, currently running on Google App Engine Standard (Java)
processor: Process bounded and unbounded data and write to PubSub and BigQuery, currently using Google Dataflow (Apache Beam) and supports processing of Google Analytics hits and AWS Kinesis events
serializer: Serialize structured data, currently using protocol buffers
infrastructor: Infrastructure as code to easily setup API:s and services required, currently using Google Deployment Manager
predictor (backlog) predictions made on streaming data
pseudonymizor (backlog) pseudonymizing personal and/or sensitive data
ruler (backlog) processing rules for personal data
activator (backlog) serving predictions via REST/gRPC
orchestrator (backlog) workflow management DAGs using Google Cloud Composer

Setup

Follow instructions in wiki how to set up the various parts in DataHem

Background

DataHem was started in June 2017 by robertsahlin / ML-engineer. It was open sourced and officially brought under Mathem's mhlabs Github account and announced in May 2018.

The name DataHem is a play of words to resemble MatHem, the Swedish online grocery store where DataHem is developed. "Data" = "data". "Hem" = the swedish word for "Home".

mortium91/datahem