/data-science

Supporting materials for weblog series on data science

Primary LanguageRMIT LicenseMIT

Instructions for usage

Explanation of some of these scripts can be found on my weblog. Below is the quick guide to getting them running.

basic_feature_extraction.R

  1. Download the open bearing dataset.

  2. Move the bearing_IMS directory to the same level as the bearing_snippets directory OR

    Modify the first line of the script to point basedir to the bearing_IMS/1st_test directory.

  3. Run basic_feature_extraction.R! This writes the basic feature vectors to b1.csv through b4.csv.

basic_feature_graphing.R

  1. For the first time through, run basic_feature_extraction.R to generate the features. Thereafter, the features are written to files b1.csv, b2.csv, b3.csv, and b4.csv, and you can go straight to step 2.

  2. Run basic_feature_graphing.R!

more_features.R

  1. Perform steps 1 and 2 of basic_feature_extraction.R.

  2. Run more_features.R! This writes the full feature vectors to b1_all.csv through b4_all.csv.

feature_correlation.R

  1. Run more_features.R, so the features are stored in files b1_all.csv through b4_all.csv.

  2. Run feature_correlation.R, to see features with high correlation.

optimise.rb

  1. Run feature_correlation.R to output sets of features with high correlation.

  2. Run optimise.rb to select the minimal set of uncorrelated features.

feature_information.R

  1. Run more_features.R, so the features are stored in files b1_all.csv through b4_all.csv.

  2. If desired, modify line 25 of feature_information.R to include only the features you are interested in (e.g. after running optimise.rb and finding a different minimal set).

  3. Run feature_information.R to generate an interesting graph! It also writes the full feature vector plus state labels to all_bearings.csv, and the best 14 features plus state labels to all_bearings_best_fv.csv.

kmeans.R

  1. Run feature_information.R, so the minimised set of features are written to all_bearings_best_fv.csv.

  2. Run kmeans.R to select the best k-means model! It also writes it to kmeans.obj.

relabel.R

  1. Run feature_information.R, so the minimised set of features are written to all_bearings_best_fv.csv.

  2. Run kmeans.R, so the best k-means model is written to kmeans.obj.

  3. Visualise the results using the graphs generated by kmeans.R. Alter the filename on line 7 to match the best k-means model. If needed, alter the cluster numbers or class labels in relabel.R to better match the data.

  4. Run relabel.R to modify the state labels. It also plots a state transition graph, and writes the new data to all_bearings_relabelled.csv.

training_set.R

  1. Requires features and labels in all_bearings_relabelled.csv, which can be generated by relabel.R.

  2. Run training_set.R to randomly pick 70% of the data rows as a training set. The row numbers are written to train.rows.csv.

ann_mlp.R

  1. Requires train.rows.csv and all_bearings_relabelled.csv (which can be generated by earlier scripts).

  2. Run ann_mlp.R to train and test an array of MLP ANNs with varying parameters. Parameters include:

    • Hidden neurons in the range 2 to 30 inclusive
    • Different class weightings to handle uneven counts of class labels
    • Data normalisation, neuron range, and neither to handle wide feature range disparities
  3. The table of results is written to ann.results.csv, all trained models are written to ann.models.obj, and the best (highest accuracy) model is written to best.ann.obj.

rpart.R

  1. Requires train.rows.csv and all_bearings_relabelled.csv (which can be generated by earlier scripts).

  2. Run rpart.R to train and test an array of RPART decision trees. Different class weightings are applied to handle uneven counts of class labels.

  3. The table of results is written to rpart.results.csv, all trained models are written to rpart.models.obj, and the best (highest accuracy) model is written to best.rpart.obj.

knn.R

  1. Requires train.rows.csv and all_bearings_relabelled.csv (which can be generated by earlier scripts).

  2. Run knn.R to train and test an array of k-nearest neighbour weighted classifiers with varying parameters. Parameters include:

    • Different kernels on the weightings (all 10 in the kknn library)
    • All k values from {1, 3, 5, 10, 15, 20, 35, 50}
  3. The table of results is written to knn.results.csv, all trained models are written to knn.models.obj, and the best (highest accuracy) model is written to best.knn.obj.

svm.R

  1. Requires train.rows.csv and all_bearings_relabelled.csv (which can be generated by earlier scripts).

  2. Run svm.R to train and test an array of Support Vector Machine classifiers with varying parameters. Parameters include:

    • Gamma from {10^-6, 10^-5, 10^-4, 10^-3, 10^-2, 10^-1}
    • Cost from {10^0, 10^1, 10^2, 10^3}
    • Different class weightings to handle uneven counts of class labels
  3. These gamma and cost values correspond to a rough grid search. Finer search should be performed in the region of the pair with highest accuracy.

  4. The table of results is written to svm.results.csv, all trained models are written to svm.models.obj, and the best (highest accuracy) model is written to best.svm.obj.