California-Water-Runoff-Forecasting: A Python repository from bjwbell

Directories:

I've split the data into the following directories:

1 .\data
Contains all the data files with the last 10 instances removed (this is so we can train and then test)

2. .\full_data
Contains all the data files with none of the instances removed

3. .\models
Contains the generated models by WEKA.

4. .\spreadsheets
Contains misc spreadsheets.

5. .\test_data
Contains the data files with only the last 10 instances.

6. .\results
Contains the results of training the models on .\data and then testing on .\test_data

===DATA FILES== (contained in .\data and .\full_data)

File: AmericanRiverInfo_ALL.arff
Info: Contains 110 instance of the water years
1901 to 2010. Contains the raw data with all the attributes.

File: AmericanRiv_Train.arff
Info: Same as AmericanRiverInfo_ALL.arff except
it also has the water year as an attribute and it's whittled down to ~220 attributes
and some of the years may be removed.
Years:1901-2010

File: AmericanRiv_Forecasts3.csv/AmericanRiv_Forecasts3.arff
Info: The first attribute is the forecast and determines the cutoff between real data and
averaged data.
For instance the instance with 1-Apr-00 as the value for FCAST_DATE would have data
up to and including March but everything after March is the averaged values.

The AmericanRiv_Forecasts3.arff is identical to AmericanRiv_Forecasts3.arff except
it does not have the FCAST_DATE attribute.
Years:1-Feb-00 through 1-May-10

File: AmericanRiv_Test1.arff
Info: Same as AmericanRiver_Train.arff except it only has the last 10 year
i.e. 2001-2010
Years:2001-2010

File: Forecasts.csv/Forecasts.xlsx
Info: Contains the forecasts by engineers for the last ten years.

//DATA FILES

==TRAINING==

To train the models run
.\train_models.bat
there's also a .\train_models.sh which should do the same as .\train_models.bat but I haven't actually run it.

To test the models run
.\test_models.bat
ditto on .\test_models.sh

To compare the results checkout the files in .\results and the forecasts in .\spreadsheets (the forecast files are named Forecasts-Feb.xlsx etc).

==Bagging Least Med Sq Apr1 Forecast==
To create the Apr1 forecast using the bagging least med sq trainging model use
train_bagging_only_apr1_leastmedsq.bat
The above batch script trains the data on data\AmericanRiv_TrainThruApr.arff

To test the Apr1 forecast use test_bagging_only_apr1_leastmedsq.bat, it'll put the results in "results\bagging_only_apr1_leastmedsq.out".

The below batch snippet compares the bagging least.med.sq forecast with the human ensemble:

echo "import util_apr
util_apr.compare_model_with_human('Bagging.LeastMedSq', 'bagging_only_apr1_leastmedsq.out')" | python

==//Bagging Least Med Sq Apr1 Forecast==

//TRAINING

==RESULTS==
I've put the comparisons of the WEKA linear regression forecasts with the human ensemble forecasts in the
results\Comparisons-Feb.xlsx,..., results\Comparisons-May.xlsx.
In all cases the human ensemble forecast outperformed the WEKA linear regression forecasts :-(

bjwbell/California-Water-Runoff-Forecasting