Directories: I've split the data into the following directories: 1 .\data Contains all the data files with the last 10 instances removed (this is so we can train and then test) 2. .\full_data Contains all the data files with none of the instances removed 3. .\models Contains the generated models by WEKA. 4. .\spreadsheets Contains misc spreadsheets. 5. .\test_data Contains the data files with only the last 10 instances. 6. .\results Contains the results of training the models on .\data and then testing on .\test_data ===DATA FILES== (contained in .\data and .\full_data) File: AmericanRiverInfo_ALL.arff Info: Contains 110 instance of the water years 1901 to 2010. Contains the raw data with all the attributes. File: AmericanRiv_Train.arff Info: Same as AmericanRiverInfo_ALL.arff except it also has the water year as an attribute and it's whittled down to ~220 attributes and some of the years may be removed. Years:1901-2010 File: AmericanRiv_Forecasts3.csv/AmericanRiv_Forecasts3.arff Info: The first attribute is the forecast and determines the cutoff between real data and averaged data. For instance the instance with 1-Apr-00 as the value for FCAST_DATE would have data up to and including March but everything after March is the averaged values. The AmericanRiv_Forecasts3.arff is identical to AmericanRiv_Forecasts3.arff except it does not have the FCAST_DATE attribute. Years:1-Feb-00 through 1-May-10 File: AmericanRiv_Test1.arff Info: Same as AmericanRiver_Train.arff except it only has the last 10 year i.e. 2001-2010 Years:2001-2010 File: Forecasts.csv/Forecasts.xlsx Info: Contains the forecasts by engineers for the last ten years. //DATA FILES ==TRAINING== To train the models run .\train_models.bat there's also a .\train_models.sh which should do the same as .\train_models.bat but I haven't actually run it. To test the models run .\test_models.bat ditto on .\test_models.sh To compare the results checkout the files in .\results and the forecasts in .\spreadsheets (the forecast files are named Forecasts-Feb.xlsx etc). ==Bagging Least Med Sq Apr1 Forecast== To create the Apr1 forecast using the bagging least med sq trainging model use train_bagging_only_apr1_leastmedsq.bat The above batch script trains the data on data\AmericanRiv_TrainThruApr.arff To test the Apr1 forecast use test_bagging_only_apr1_leastmedsq.bat, it'll put the results in "results\bagging_only_apr1_leastmedsq.out". The below batch snippet compares the bagging least.med.sq forecast with the human ensemble: echo "import util_apr util_apr.compare_model_with_human('Bagging.LeastMedSq', 'bagging_only_apr1_leastmedsq.out')" | python ==//Bagging Least Med Sq Apr1 Forecast== //TRAINING ==RESULTS== I've put the comparisons of the WEKA linear regression forecasts with the human ensemble forecasts in the results\Comparisons-Feb.xlsx,..., results\Comparisons-May.xlsx. In all cases the human ensemble forecast outperformed the WEKA linear regression forecasts :-(