/youtube-divolte-kafka-druid-superset

A proof of concept about collecting real-time clickstream data using Javascript, Divolte Collector, Apache Kafka, Kafka Streams, Apache Druid and Apache Superset.

Primary LanguagePythonMIT LicenseMIT

version Twitter

Real-time Clickstream analysis

A proof of concept about collecting real-time clickstream data using Javascript, Divolte Collector, Apache Kafka, Kafka Streams, Apache Druid and Apache Superset.

At the end of the youtube video attached here, we are going to compare our results with Microsoft Clarity and Google Analytics. The comparison is going to be just for fun, as those platforms are complete products and built for years by big companies.

Youtube video

Demo

(this demo may not be available after some time, due to cloud infrastructure costs)

You can visit the website as a client, and then go to Apache superset dashboard to see real-time results.

Apache Superset dashboard credentials:

username: admin
password: admin

Architecture Diagram

alt text

Dashboard

alt text

Technologies Used

  • A tool developed with Selenium and Python (used for website user visits simulation)
  • Javascript (used with ipstack tool to collect user information when visiting the website)
  • Divolte Collector (used as a server to collect clickstream data in Apache Kafka)
  • Apache Avro (used inside Divolte Collector as a schema for the payload)
  • Apache Kafka (used as a publish/subscribe system)
  • Kafka Manager (CMAK) (used as a dashboard manager of Apache Kafka cluster)
  • Kafka Streams (used to convert avro payload to json)
  • Apache Druid (used as a high performance real-time analytics database)
  • Apache Superset (used as a dashboard for data visualization)
  • Mapbox (used in Apache Superset in order to use maps)
  • Docker and Swarm (used for containerization and deployment)
  • Ansible (used to facilitate deployment on remote servers)
  • Digitalocean (used as a cloud infrastructure)

Youtube videos I made on clickstream data collection

Requirements

  • Docker
  • Ansible (if you want to automate the deployment on remote server)

Getting Started

Clone repository

git clone https://github.com/soufianeodf/youtube-divolte-kafka-druid-superset.git

cd youtube-divolte-kafka-druid-superset

Website

  • Add microsoft clarity and google analytics tags to the header of index.html.
  • Change the divolte-ip-address value by the ip-address or DNS of your divolte server in index.html.
  • You can change if you want the nginx config file.
  • You can adapt the payload sent from main.js.

Divolte Collector

You can modify divolte-collector config files and adapt them to your needs:

Zookeeper, Apache Kafka and Kafka Manager

You can control all config variables of Zookeeper, Apache Kafka and Kafka Manager from docker-compose.yml.

Kafka Streams

You can modify Kafka Streams variable from application.properties file.

Make sure that the avro file is them same as the one you have in Divolte Collector server.

Don't forget to generate java .jar after you make any change.

Apache Druid

You can modify the Apache Druid config file if you want.

After running Apache Druid, to filter payloads having null as country value, we use the following:

{
   "type":"not",
   "field":{
      "type":"selector",
      "dimension":"country",
      "value":null
   }
}

Apache Superset

superset.sh is the file responsible for setting the username and password of Apache Superset dashboard and more, make sure you execute it after Apache Superset is up and running.

In order for Apache Superset to use maps, it's using Mapbox under the hood, so for that, you need to set up the mapbox key in the config file:

MAPBOX_API_KEY = "you_mapbox_token"

After running Apache Superset, to connect to Apache Druid:

druid://<User>:<password>@<Host>:<Port-default-8888>/druid/v2/sql

Docker

You need to build your images and push them to your docker hub repository, because docker swarm suppose that the images are already built and exists in a docker registry.

Adapt docker-compose.yml to your needs, and then build and push the images to your docker hub repository as bellow:

docker-compose build
docker-compose push

Deploy on DigitalOcean with Ansible

Ansible project is highly inspired from pg3io/ansible-do-swarm, shout-out to him.

The ansible playbook is doing the following tasks:

  • Create droplets in DigitalOcean.
  • Install Docker on created droplets.
  • Create cluster Docker Swarm with single manager.
  • Copy docker-compose.yml and superset.sh files to manager node.
  • Run Docker Swarm.
  • Execute superset.sh.

Playbook Variables

All variables of the playbook can be found in vars.yml


  • do_token : token Digital Ocean link.
  • droplets : list of droplets to deploy, first of the list will be the manager.
  • do_region : datacenter location . Listing: curl -X GET --silent "https://api.digitalocean.com/v2/regions?per_page=999" -H "Authorization: Bearer " |jq -r '{name: .regions[].name, regions_id: .regions[].slug}'
  • do_size : droplet size. Listing: curl -X GET --silent "https://api.digitalocean.com/v2/sizes?per_page=999" -H "Authorization: Bearer " |jq -r '.sizes[] .slug' | sort
  • ssh_key_ids : register a ssh in your DigitalOcean account and then obtain its id with the following command: curl -X GET -H 'Content-Type: application/json' -H 'Authorization: Bearer '$DOTOKEN "https://api.digitalocean.com/v2/account/keys" 2>/dev/null | jq '.ssh_keys[] | {name: .name, id: .id}'

Run

cd ansible/

ansible-playbook do-swarm.yml -e do_token="<DO TOKEN>"

Troubleshooting

Apache Superset

Issue: Unexpected Exception: name 'basestring' is not defined when invoking ansible2

Solution: pip uninstall dopy and pip3 install git+https://github.com/eodgooch/dopy@0.4.0#egg=dopy

Issue: The CSRF session token is missing

Solution: set up this property WTF_CSRF_ENABLED = False in config file

Website visits simulation with Selenium

In the video, I have simulated with a Selenium tool, visits to the website from different browsers, Operating systems and countries as described in the image bellow, to check if our clickstream solution we built is able to intercept those hits accurately:

disclosure

The Selenium tool that simulate website user visits is private at this moment because it's still in the development phase, it will be public as soon as it's completed.

License

Licensed under the MIT License.