Binary classification machine learning model to predict whether it will rain tomorrow in Australia.

This is a neural network that uses binary classification to predict whether, given meteorological observations of a given day at a given weather station in Australia, it will rain there the next day. The model is trained and tested on a dataset containing about 10 years of daily weather observations from numerous Australian weather stations.

There are two separate implementations in this project: one using Tensorflow 2 and Keras, and another using scikit-learn.

The model currently has an accuracy of approximately 87%. Given that it doesn't rain exactly 50% of days, there are a lot more rows in the dataset where the target "RainTomorrow" column has a "No" value than "Yes". This means that you can make a complete guess and be right by random chance about 70% of the time. My goal was therefore to get the model accuracy to somewhere around 90%.

Here is the structure of the dataset used for training and testing, showing the header and two data rows:

Date	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	WindDir3pm	WindSpeed9am	WindSpeed3pm	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday	RISK_MM	RainTomorrow
2010-10-20	Sydney	12.9	20.3	0.2	3	10.9	ENE	37	W	E	11	26	70	57	1028.8	1025.6	3	1	16.9	19.8	No	0	No
2017-06-25	Brisbane	11	24.2	0	2.2	9.8	ENE	20	SSW	NNE	2	7	68	53	1020.5	1017.3	6	3	15.9	22.6	No	0	Yes

The data was sourced from this Kaggle dataset compiled by Joe Young and Adam Young, which was in turn sourced from http://www.bom.gov.au/climate/data and http://www.bom.gov.au/climate/dwo/. This data is available under a Creative Commons (CC) Attribution 3.0 licence. For details on the meaning of each observation, see this page. Copyright Commonwealth of Australia, Bureau of Meteorology.

Requirements

Python (developed with version 3.7.4).
See dependencies.txt for packages and versions (and below to install).

Data preprocessing

Data preprocessing is done by a combination of Pandas (to drop NaN rows and map Yes/No strings into 1/0 binary integers), scikit-learn (to scale/normalize numeric features by calculating the z-score of each of their values), and Tensorflow to apply one-hot encoding to categorical features. The model's input layer is thus a combination of pre-normalized numeric features and one-hot encoded categorical features.

The following columns were skipped and not used as features for the model; all the rest were used:

Date: Not relevant.
RainToday: This is just a boolean representation of the numeric column "Rainfall". Experimented with adding this feature to the model, but had no effect on accuracy.
RISK_MM: This is the amount of rain for the following day. This was used to create the label/target column "RainTomorrow". This would be used if the model was doing regression, rather than classification.
RainTomorrow: Used as the training label/target.

The output of the model is just a single sigmoid-activation neuron which predicts target variable "RainTomorrow".

Setup

Clone the Git repository.
Install the dependencies:

pip install -r dependencies.txt

Run

python -W ignore tensor_flow.py

python -W ignore scikit_learn.py

Note that there is a current bug in TensorFlow where deprecation warnings are printed at the usage of feature columns, even though the new feature column API is indeed being used. It has been fixed and will be in a future release of TensorFlow. In the meantime, will just have to live with the warning output.

Monitoring/logging

After training, run:

$ tensorboard --logdir logs/fit
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.1.0 at http://localhost:6006/ (Press CTRL+C to quit)

Then open the above URL in your browser to view the model in TensorBoard.

GlenCrawford/australia_rain_tomorrow_binary_classification_prediction

Binary classification machine learning model to predict whether it will rain tomorrow in Australia.

Requirements

Data preprocessing

Setup

Run

Monitoring/logging