/productionFailures

Binary classification of products passage or failure of quality control

Primary LanguageJupyter Notebook

Binary Classification with Apache Spark / HDFS

↖data source

The goal of the competition is to predict which parts will fail quality control

My goal is to utilize the hadoop ecosystem to handle a large dataset and establish a pipeline for machine learning

  • Aggregate columns using RDD transformations
  • Create a column that indicates which of those column aggregations are outliers.
  • Model data with Spark Machine Learning package
  • Predict on test data
  • Run this as is to use the toy data set example