Gathering statistics from WikiMedia EventStreams using Kafka Streams. Exposing the statistics via a REST API (using Kafka-Streams Interactive-Queries).
The only dependency for running this project is Docker Compose.
Start the application components, by running the following commands:
NOTE: Make sure docker daemon is running.
# start the local Kafka cluster (references 'scipts/' for topic creation)
$ docker-compose up
$ docker-compose exec kafka bash
$ kafka-configs --bootstrap-server localhost:9092 --alter --entity-type topics --entity-name WikiEvents --add-config max.message.bytes=100485880
$ cd ./wikipedia-statistics/src/main/python
# activate virtual env
$ source ./venv/Scripts/activate
$ # (only do it once) pip install -r requirements.txt
$ export PYTHONPATH="$(pwd)/venv/Lib/site-packages"
# activate WikiMedia's EventStreams Kafka-producer
$ python3 ./wikiEventProducer.py --bootstrap_server localhost:29092 --topic_name WikiEvents --events_to_produce 10
NOTE: On Windows OS use gradlew.bat. On Linux OS use gradlew Now, to run the Kafka Streams application, simply run:
$ cd ./wikipedia-statistics
# build project
$ ./gradlew build
# run project
$ ./gradlew run --info
NOTE: In the url, month can be {hour, week, month, year}. Likewise, 'English', which can be any language.
- curl localhost:7000/api.wikiStats/month/per-language/countPagesCreated
- curl localhost:7000/api.wikiStats/month/per-userType/countPagesCreated
- curl localhost:7000/api.wikiStats/month/per-language/countPagesModified
- curl localhost:7000/api.wikiStats/month/per-userType/countPagesModified
- curl localhost:7000/api.wikiStats/month/per-language/English/mostActiveUsers
- curl localhost:7000/api.wikiStats/month/per-userType/English/mostActiveUsers
- curl localhost:7000/api.wikiStats/month/per-language/English/mostActivePages
- curl localhost:7000/api.wikiStats/month/per-userType/English/mostActivePages
These have been a huge help for me:
- The book Mastering Kafka Streams and ksqlDB by Mitch Seymour.
- Introduction to Apache Kafka with Wikipedia’s EventStreams service article on Medium.