The data set used for this project originates from the Australian Government's Bureau of Meteorology and can be accessed here. The dataset used in this project includes additional columns such as 'RainToday,' and the target variable is 'RainTomorrow,' which was obtained from Rattle.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics
df = pd.read_csv("Weather_Data.csv")
df.head()
First, we perform one-hot encoding to convert categorical variables to binary variables.
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])
Next, we replace the values in the 'RainTomorrow' column, changing them from categorical to binary (0 for 'No' and 1 for 'Yes').
df_sydney_processed.replace(['No', 'Yes'], [0, 1], inplace=True)
We split the data into training and testing sets.
df_sydney_processed.drop('Date', axis=1, inplace=True)
df_sydney_processed = df_sydney_processed.astype(float)
features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
Y = df_sydney_processed['RainTomorrow']
We use the Linear Regression model and evaluate its performance.
x_train, x_test, Y_train, Y_test = train_test_split(features, Y, test_size=.2, random_state=10)
LinearReg = LinearRegression()
LinearReg.fit(x_train, Y_train)
predictions = LinearReg.predict(x_test)
We use the K-Nearest Neighbors classifier and assess its performance.
k = 4
KNN = KNeighborsClassifier(n_neighbors=k).fit(x_train, Y_train)
predictions2 = KNN.predict(x_test)
We employ the Decision Tree classifier and evaluate its performance.
Tree = DecisionTreeClassifier()
Tree = Tree.fit(x_train, Y_train)
predictions3 = Tree.predict(x_test)
Lastly, we use Logistic Regression and assess its performance.
x_train2, x_test2, Y_train2, Y_test2 = train_test_split(features, Y, test_size=.2, random_state=1)
LR = LogisticRegression(C=1.0, solver='liblinear').fit(x_train2, Y_train2)
predictions4 = LR.predict(x_test2)
We calculate various evaluation metrics for each model:
-
Linear Regression:
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- R-squared (R2)
-
K-Nearest Neighbors:
- Accuracy
- Jaccard Index
- F1 Score
-
Decision Tree Classifier:
- Accuracy
- Jaccard Index
- F1 Score
-
Logistic Regression:
- Coefficients
- Predictions
- Accuracy
- Jaccard Index
- F1 Score
Here are the results for each model:
- MAE: 0.2563
- MSE: 0.1157
- R2: 0.3402
- Accuracy: 0.8183
- Jaccard Index: 0.7901
- F1 Score: 0.5966
- Accuracy: 0.7588
- Jaccard Index: 0.7122
- F1 Score: 0.5730
- Coefficients: [coefficients_list]
- Accuracy: [accuracy_score]
- Jaccard Index: [jaccard_index]
- F1 Score: [f1_score]
You can use these results to assess the performance of each model for your specific problem.