First, Download Spark 2.4.5
export SPARK_HOME=your_spark_home_dir
./sbt "sparkSubmit \
--class com.salesforce.hw.OpTitanicSimple \
-- $PWD/src/main/resources/TitanicDataset/TitanicPassengersTrainData.csv"
./sbt "sparkSubmit \
--class com.salesforce.hw.titanic.OpTitanic -- \
--run-type=train --model-location=/tmp/titanic-model \
--read-location Passenger=$PWD/src/main/resources/TitanicDataset/TitanicPassengersTrainData.csv"
./sbt "sparkSubmit \
--class com.salesforce.hw.titanic.OpTitanic -- \
--run-type=score --model-location=/tmp/titanic-model \
--read-location Passenger=$PWD/src/main/resources/TitanicDataset/TitanicPassengersTrainData.csv \
--write-location /tmp/titanic-scores"
./sbt "sparkSubmit \
--class com.salesforce.hw.titanic.OpTitanic -- \
--run-type evaluate \
--model-location /tmp/titanic-model \
--read-location Passenger=$PWD/src/main/resources/TitanicDataset/TitanicPassengersTrainData.csv \
--write-location /tmp/titanic-eval \
--metrics-location /tmp/titanic-metrics"
./sbt "sparkSubmit \
--class com.salesforce.hw.boston.OpBoston -- \
--run-type=train --model-location=/tmp/boston-model \
--read-location BostonHouse=$PWD/src/main/resources/BostonDataset/housing.data"
./sbt "sparkSubmit \
--class com.salesforce.hw.boston.OpBoston -- \
--run-type=score --model-location=/tmp/boston-model \
--read-location BostonHouse=$PWD/src/main/resources/BostonDataset/housing.data \
--write-location /tmp/boston-scores"
./sbt "sparkSubmit \
--class com.salesforce.hw.boston.OpBoston -- \
--run-type evaluate \
--model-location /tmp/boston-model \
--read-location BostonHouse=$PWD/src/main/resources/BostonDataset/housing.data \
--write-location /tmp/boston-eval \
--metrics-location /tmp/boston-metrics"
./sbt "sparkSubmit \
--class com.salesforce.hw.iris.OpIris -- \
--run-type=train --model-location=/tmp/iris-model \
--read-location Iris=$PWD/src/main/resources/IrisDataset/iris.data"
./sbt "sparkSubmit \
--class com.salesforce.hw.iris.OpIris -- \
--run-type=score --model-location=/tmp/iris-model \
--read-location Iris=$PWD/src/main/resources/IrisDataset/bezdekIris.data \
--write-location /tmp/iris-scores"
./sbt "sparkSubmit \
--class com.salesforce.hw.iris.OpIris -- \
--run-type evaluate \
--model-location /tmp/iris-model \
--read-location Iris=$PWD/src/main/resources/IrisDataset/bezdekIris.data \
--write-location /tmp/iris-eval \
--metrics-location /tmp/iris-metrics"
./sbt "sparkSubmit \
--class com.salesforce.hw.dataprep.JoinsAndAggregates -- \
$PWD/src/main/resources/EmailDataset/Clicks.csv \
$PWD/src/main/resources/EmailDataset/Sends.csv"
./sbt "sparkSubmit \
--class com.salesforce.hw.dataprep.ConditionalAggregation -- \
$PWD/src/main/resources/WebVisitsDataset/WebVisits.csv"
Look for the output file(s) in the location you specified. For instance, you can use avro-tools
to inspect the scores files (on mac simply run brew install avro-tools
to install it).
Other than that, the best way to verify the results is to look through the logs that should have been generated during the run. It has all kinds of information about the features the processing and the model reliability.
Experiment with adding feature changes or exploring more models in any of the provided workflows.
See how high you can get your auROC!