/iot_spark_streaming

Topics in Big Data and IOT Assignment

Primary LanguagePython

Topics for Big Data and IOT - Spark Streaming

Installation

Download and Install Spark 2.4.5 with Hadoop 2.7 following instructions from here
Python3 is required. Also install numpy using python3 -m pip install numpy

Running the code

Data preparation and model training

The original data for this is from kaggle. Download the original dataset from there to get authentic data.
The data used here is manually altered a little for ease of use

Generation of data

If in possesion of the dataset, training data for the model can be generated using the script create_dataset.py. It collates data from multiple CPU, Disk and Network statistic files and creates a single file.
This file is used as input to train the model. A sample generated dataset is present in data/data.tgz (filename: trainingdata.csv)
Sample run of create dataset:
python3 create_dataset.py data/realAWSCloudWatch data/trainingdata.csv

Training KMeans model

Extract the sample training data, or create your own dataset, and run
python3 train_model.py <data file> [num clusters]
Number of clusters is set to 10 here by default. Note the output of the program, and set the label of the most populous cluster in the EXPECTED_LABEL field in monitor_usage.py
The model is saved into the kmeans.trained folder

Running the code

The code needs to be run in two parts - the Spark Streaming Engine, and the file streamer.
Start spark using python3 monitor_usage.py <streaming directory> [window size]. Window Size is optional, and set to 3 seconds by default
The streaming directory is where Spark reads files as input to the stream processor. This has to match what is provided in the streamer program
Run the streamer using python3 multistreamer.py <streaming_directory> <files>.
The files for streaming can be found in the data.tgz archive in the data folder. Sample run is as follows:

python3 monitor_usage.py streamdata
python3 multistreamer.py streamdata data/*_anomaly.csv