Bosch production line performance

##Exploratory data analysis based on Apache Spark RDD based on Python API and Jupyter notebook.

Decision_Tree_Refine.ipynb

##Running the code Can be run directly on desktop with a PySpark standalone mode. A workaround to setting up master and worker nodes (requires $SPARK_HOME to be set) :

$ nohup python script.py &

##Feature reduction Feature reduction (/information) on numeric data has been performed by exploiting the high level Spark dataframe (SparkSQL):

feature_correlation.py

Highly correlated feature information are constructed based on Pearson correlation criterion. The reduced features can be used for training SPARK ML or MLlib.

##Decision tree / Random Forest / GBT Once we have the information for the list of columns to be removed then, we can invoke MLlib (RDD based machine learning in Spark)) or ML (dataframe based machine learning in Spark). The data set is split in to (0.7,0.3) ratio for traning and test set respecteively . The test set predictions are accessed based on :

Accuracy , confusion matrix, and Matthews correlation coefficient .
Decision_Tree_Reduced_Feature.py
Random_forest_reduced_feature.py

feiyuxinfeng/bosch-production-line-performance

Bosch production line performance