
Spakify is a music streaming sevice as similar to Spotify. Every users' activities on Sparkify application are logged and sent to Kafka cluster. To improve the business, the data team will collect data to a Big Data Platform for further processing, analysing and extracting insights info for respective actions.

To use this Sparkify, user needs to register an account with free or paid. As a registered user, they can upgrade/downgrade from free/paid to paid/free level. User can also leave the platform by cancelling account.

  • The target data pipeline looks as below:

  • Business questions:

    1. Which gender is more active?
    2. Which level is more active (free or paid)?
    3. Which factors (based on collected data) make users stop subcribe the service (churn)?

Note: To connect PowerBI to Hadoop (Impala specifically), refer this guide.


  • Public dataset: Million Song Dataset
  • Contains 18 columns which has the information of customers(gender, name, etc.) and API events(login, playing next song, etc.)
  • Experiment period: 2018–10–01 to 2018–12–01
  • Kafka message example:
        "auth":"Logged In",
        "location":"New York-Newark-Jersey City, NY-NJ-PA",
        "userAgent":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
        "song":"You\\'re Not Alone",

BDP access

Quick commands

Run Kafka producer

git clone
cd bdp-demo
python --time_interval 2

Run Spark streaming

git clone
cd bdp-demo

Clear topic --zookeeper --delete --topic streaming.itbi.demo.music_service

Impala refresh table

impala-shell -i -k --ssl -q "REFRESH bdp_ap_it.music_service_raw"

Drop hive table

beeline -n 'hive' --verbose=true \
-u "jdbc:hive2://;principal=hive/_HOST@$BDP_KERBEROS_REALM;ssl=true;sslTrustStore=${BDP_TRUSTSTORE_PATH};trustStorePassword=${BDP_TRUSTSTORE_PASSWORD}" \
-e "drop table bdp_ap_it.music_service_raw"
