/WikiMedia-Events-KafkaStreams-App

REST API that shows statistics of WikiMedia using Kafka Streams

Primary LanguageJava

WikiMedia Events Statistics Application

Gathering statistics from WikiMedia EventStreams using Kafka Streams. Exposing the statistics via a REST API (using Kafka-Streams Interactive-Queries).

Kafka Pipeline FlowChart

kafkaFlowChart

Running Locally

The only dependency for running this project is Docker Compose.

Start the application components, by running the following commands:

Kafka Cluster

NOTE: Make sure docker daemon is running.

# start the local Kafka cluster (references 'scipts/' for topic creation)
$ docker-compose up

$ docker-compose exec kafka bash
$ kafka-configs --bootstrap-server localhost:9092 --alter --entity-type topics --entity-name WikiEvents --add-config max.message.bytes=100485880

Kafka Producer (Python)

$ cd ./wikipedia-statistics/src/main/python
# activate virtual env
$ source ./venv/Scripts/activate
$ # (only do it once) pip install -r requirements.txt
$ export PYTHONPATH="$(pwd)/venv/Lib/site-packages"
# activate WikiMedia's EventStreams Kafka-producer
$ python3 ./wikiEventProducer.py --bootstrap_server localhost:29092 --topic_name WikiEvents --events_to_produce 10

Kafka-Streams Client (Java)

NOTE: On Windows OS use gradlew.bat. On Linux OS use gradlew Now, to run the Kafka Streams application, simply run:

$ cd ./wikipedia-statistics

# build project
$ ./gradlew build

# run project
$ ./gradlew run --info

Query the API

NOTE: In the url, month can be {hour, week, month, year}. Likewise, 'English', which can be any language.

countPagesCreated

  • curl localhost:7000/api.wikiStats/month/per-language/countPagesCreated
  • curl localhost:7000/api.wikiStats/month/per-userType/countPagesCreated

countPagesModified

  • curl localhost:7000/api.wikiStats/month/per-language/countPagesModified
  • curl localhost:7000/api.wikiStats/month/per-userType/countPagesModified

MostActiveUsers

  • curl localhost:7000/api.wikiStats/month/per-language/English/mostActiveUsers
  • curl localhost:7000/api.wikiStats/month/per-userType/English/mostActiveUsers

MostActivePages

  • curl localhost:7000/api.wikiStats/month/per-language/English/mostActivePages
  • curl localhost:7000/api.wikiStats/month/per-userType/English/mostActivePages

Useful Resources

These have been a huge help for me: