This repo contains semi automatic tools for instaling snowplow to GoogleCloudPlatform. It automated tasks from a great tutorial from Simo Ahava https://www.simoahava.com/analytics/install-snowplow-on-the-google-cloud-platform/
You should read Simo's article first to get some overview about whole Snowplow GCP ecosystem. All scripts have been tested on Debian runing in Windows 10 Linux Subsystem.
First make some steps from tutorial.
- Prepare Google Cloud Project and enable billing. Remember project-id
- Enable apis PUB/SUB, Compute, Dataflow
- Create service account, download auth.json and remember service service-acount-email
Following steps are optional. You can skip them if you already use gcloud command line tools.
- Install gcloud (https://cloud.google.com/sdk/docs/#deb)
- Init gcloud command line. Run
gcloud init
- Init bigquery commad line. Run
bq init
Get template from git
git clone https://github.com/etnetera-activate/snowplow-gcp-template.git
cd snowplow-gcp-template
Run ./install.sh
for instaling package for generating UUID and jq JSON parser
Edit ./gcloud-config-mustr.sh
. Replace PROJECTID, SERVICEACCOUNT and you can change ZONE and REGION if you wish.
Save as ./gcloud-config.sh
.
Then run ./gcloud-init-project.sh
. This script will:
- Prepare some config files and start/stop script for ETL.
- Create all pubsub topics and subscribers
- Create storage bucket and copy config files
- Create big query dataset
- Create instance template for collector and create collector group
After script finish, you should manually configure firewall and setup javascript tracker manualy. https://www.simoahava.com/analytics/install-snowplow-on-the-google-cloud-platform/#step-3-create-a-load-balancer
The ETL process is quite expensive as is utilize Google Dataflow. You should start in only for short time and then kill it.
Simple use ./start_etl.sh
and ./stop_etl.sh
for this purpose.
./start_etl.sh
creates new virtual machine, which runs bigquery mutator and which starts two dataflows (one for enrich and second for bigquery inserts).
After you run it you can check running instances using:
gcloud compute instances list
After some time, you should see:
- snowplow-collector-xxx instance (one or more depending on group autoscale settings)
- snowplow-etl
- some machine started for beam dataflow
- some machine started for load dataflow
You can anso check dataflows using
gcloud dataflow jobs list
There should be two running dataflows.
./stop_etl.sh
stops both dataflow and delete the ETL instance. After some time there will be only the collector machines and nothing more.