🚀 visualize summaries of the data stream(s) and develop an ML model to detect anomalies in the data set. 🚀
From / By Mohamed Aharrat https://github.com/medaharrat
https://github.com/medaharrat/anomalies-detection
http://localhost:8086/signin (influxdb userid=admin password=admin123)
http://localhost:3000 (Grafana - there are no dashboareds)
http://localhost:8080 (Chronograf - there are no dashboards)
http://localhost:9092 (Kafka - no UI)
http://localhost:8125 (Telegraf - no UI)
http://localhost:8888 (PySpark - no UI)
pyspark crashes connecting to Kafka
There are no dashboards
copy .env-example to .env
cp .env-example .env
# example values:
INFLUXDB_DB=swat
INFLUXDB_HOST=http://influxdb:8086
AUTH_TOKEN=iJHZR-dq4I5LIpFZCc5bTUHx-I7dyz29ZTO-B4W5DpU4mhPVDFg-aAb2jK4Vz1C6n0DDb6ddA-bJ3EZAanAOUw==
DEFAULT_BUCKET=swat
MONITORING_BUCKET=primary
DEFAULT_ORGANIZATION=primary
ADMIN_USERNAME=admin
ADMIN_PASSWORD=admin123
git init
git add .
git remote remove origin
git commit -m "first commit"
git branch -M main
git remote add origin git@github.com:coding-to-music/anomalies-detection-kafka-pyspark-influxdb-grafana-telegraf-chronograf.git
git push -u origin main
This project serves as an assignment project for the Open Source Technologies / Stream Mining subjects.
The project task is to visualize summaries of the data stream(s) and develop an ML model to detect anomalies in the data set.
The project streams data from an xlsx file through Kafka to a topic called SWAT.
This topic is then subscribed by the PySpark instance in another container.
Each batch is pre-processed and stored in InfluxDB as a data point together with anomalies.
Data is visualized in Grafana in 4 different boards representing anomalies detected using four different approaches.
In addition to that, a monitoring board is put in place using Telegraf to collect metrics and visualize in Chronograf.
Each part of the project is containerized using Docker, in addition to two built images
The following technologies are used:
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment. PySpark supports most of Spark’s features such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning) and Spark Core.
InfluxDB is an open-source time series database developed by the company InfluxData. It is written in the Go programming language for storage and retrieval of time series data in fields such as operations monitoring, application metrics, Internet of Things sensor data, and real-time analytics
Grafana is a multi-platform open source analytics and interactive visualization web application. It provides charts, graphs, and alerts for the web when connected to supported data sources
Chronograf is an open-source web application and visualization tool developed by InfluxData as part of the InfluxDB project
Telegraf is a plugin-driven server agent for collecting & reporting metrics. Telegraf has plugins to source a variety of metrics directly from the system it’s running on, pull metrics from third-party APIs, or even listen for metrics via a statsd and Kafka consumer services.
sh run.sh
Getting this output:
[>] Exporting variables
[>] Creating influxdb directory
[>] Creating grafana directory
[>] Creating chronograf directory
[>] Running docker compose
Building pyspark
Step 1/15 : FROM jupyter/pyspark-notebook
---> 458b41649487
Step 2/15 : COPY . .
---> Using cache
---> 344ec311bab2
Step 3/15 : RUN python3 -m pip install -r requirements.txt
---> Using cache
---> 8f94e6174a9e
Step 4/15 : USER root
---> Using cache
---> 633a675fff27
Step 5/15 : RUN chmod +x spark-submit.sh
---> Using cache
---> bb7e509d3909
Step 6/15 : RUN echo '[...] Preprocessing data'
---> Using cache
---> 6f0dcc79c1f5
Step 7/15 : RUN python3 preprocess.py
---> Using cache
---> 13aa0b8e2e19
Step 8/15 : RUN echo '[...] Fitting models'
---> Using cache
---> 3d24b4575edf
Step 9/15 : RUN python3 pipeline1_ocsvm.py
---> Using cache
---> d95a0222eb87
Step 10/15 : RUN python3 pipeline2_iso_log.py
---> Using cache
---> 6503f4bc6625
Step 11/15 : RUN python3 pipeline3_kmeans.py
---> Using cache
---> c58fd85b167d
Step 12/15 : RUN python3 pipeline4_dbscan.py
---> Running in e569dfc780c7
ERROR: Service 'pyspark' failed to build: The command '/bin/bash -o pipefail -c python3 pipeline4_dbscan.py' returned a non-zero code: 137
https://samwize.com/2016/05/19/docker-error-returned-a-non-zero-code-137/
19 May 2016
1 min read While running a node app with Docker, there is an error 137:
The command '/bin/sh -c npm install' returned a non-zero code: 137
This means an out-of-memory error.
To fix it, you can add more RAM.
Or you can add more swap memory (FREE!). Swap memory uses part of your hard disk for temporary memory.
These steps are exactly the same as a previous guide:
# Confirm you have no swap
sudo swapon -s
# Allocate 1GB (or more if you wish) in /swapfile
sudo fallocate -l 1G /swapfile
# Make it secure
sudo chmod 600 /swapfile
ls -lh /swapfile
# Activate it
sudo mkswap /swapfile
sudo swapon /swapfile
# Confirm again there's indeed more memory now
free -m
sudo swapon -s
# Configure fstab to use swap when instance restart
sudo nano /etc/fstab
# Add this line to /etc/fstab, save and exit
/swapfile none swap sw 0 0
# Change swappiness to 10, so that swap is used only when 10% RAM is unused
# The default is too high at 60
echo 10 | sudo tee /proc/sys/vm/swappiness
echo vm.swappiness = 10 | sudo tee -a /etc/sysctl.conf
docker image prune -a
docker container prune
docker volume prune