/gmis-hackathon2020

24-hour data analytics hackathon from CAHSI GMiS conference. 3rd Place

Primary LanguageJupyter Notebook

Predict Power Generation of Solar Panels

GMiS CAHSI Data Analytics Hackathon 2020 - Team 14: Mask-araid


Question: What will be the generated power voltage from a solar panel at a given time in the future given the weather conditions?

Software Used: Jupyter Lab

Programming Language: Python

Research

First, let's do some research about solar panels. According to 1876 Energy and Trace Software, the highest contributing factors to solar panels are temperature, energy conversion efficiency (power), shade, solar radiation, and location (longitude and latitude). Additionally, solar panels work more efficiently in cold temperatures, allowing the panel to produce more voltage and more electricity. Rain and snow have no effect on solar panels however cloudy days and humidity can slow down production.

Step I: EDA

First, we will perform some EDA so that we can get a feel for the data.

import pandas as pd
data_set = pd.read_csv("cahsi_data_2020/D1.csv")
data_set.head(100)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
weather_datetime solar_datetime solarRadiation uvHigh winddirAvg humidityHigh humidityLow humidityAvg qcStatus tempHigh ... windchillAvg heatindexHigh heatindexLow heatindexAvg pressureMax pressureMin pressureTrend precipRate precipTotal DC
0 2020-02-07 14:29:00 2020-02-07 14:29:1 627.70 7.0 195 24 24 24 -1 65 ... 65 65 65 65 30.06 30.05 0.60 0.0 0.0 42.036
1 2020-02-07 14:34:00 2020-02-07 14:34:1 617.31 7.0 129 24 23 23 -1 68 ... 67 68 66 67 30.06 30.05 -0.15 0.0 0.0 42.126
2 2020-02-07 14:39:00 2020-02-07 14:39:1 608.13 6.0 108 24 23 23 -1 68 ... 67 68 67 67 30.06 30.05 0.00 0.0 0.0 42.264
3 2020-02-07 14:44:00 2020-02-07 14:44:1 582.57 6.0 87 25 24 24 -1 67 ... 66 67 66 66 30.06 30.05 -0.15 0.0 0.0 42.204
4 2020-02-07 14:49:00 2020-02-07 14:49:1 571.67 6.0 38 24 24 24 -1 66 ... 66 66 66 66 30.05 30.04 -0.15 0.0 0.0 42.360
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 2020-02-07 22:24:00 2020-02-07 22:24:1 0.00 0.0 255 41 40 40 1 51 ... 51 51 51 51 30.15 30.14 0.15 0.0 0.0 0.186
96 2020-02-07 22:29:00 2020-02-07 22:29:1 0.00 0.0 3 43 41 42 1 51 ... 51 51 50 51 30.15 30.14 0.15 0.0 0.0 0.192
97 2020-02-07 22:34:00 2020-02-07 22:34:1 0.00 0.0 299 42 40 41 1 51 ... 50 51 50 50 30.15 30.14 0.00 0.0 0.0 0.192
98 2020-02-07 22:39:00 2020-02-07 22:39:1 0.00 0.0 233 42 41 41 1 51 ... 51 51 51 51 30.15 30.15 0.00 0.0 0.0 0.192
99 2020-02-07 22:44:00 2020-02-07 22:44:1 0.00 0.0 248 41 39 40 1 51 ... 51 51 51 51 30.16 30.15 0.00 0.0 0.0 0.198

100 rows × 29 columns

data_set.tail()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
weather_datetime solar_datetime solarRadiation uvHigh winddirAvg humidityHigh humidityLow humidityAvg qcStatus tempHigh ... windchillAvg heatindexHigh heatindexLow heatindexAvg pressureMax pressureMin pressureTrend precipRate precipTotal DC
7955 2020-03-30 21:29:00 2020-03-30 21:29:1 0.0 0.0 153 25 25 25 1 62 ... 62 62 62 62 30.25 30.24 0.00 0.0 0.0 0.030
7956 2020-03-30 21:34:00 2020-03-30 21:34:1 0.0 0.0 160 25 25 25 1 62 ... 62 62 62 62 30.25 30.24 0.00 0.0 0.0 0.024
7957 2020-03-30 21:39:00 2020-03-30 21:39:1 0.0 0.0 188 25 25 25 1 62 ... 62 62 62 62 30.25 30.24 0.00 0.0 0.0 0.030
7958 2020-03-30 21:44:00 2020-03-30 21:44:1 0.0 0.0 153 25 25 25 1 62 ... 62 62 62 62 30.25 30.24 -0.15 0.0 0.0 0.024
7959 2020-03-30 21:49:00 2020-03-30 21:49:1 0.0 0.0 107 25 25 25 1 62 ... 62 62 62 62 30.25 30.25 0.00 0.0 0.0 0.024

5 rows × 29 columns

Observation: Notice that as it becomes later in the day, the solar radiation, uv, and temperature decreases. The DC voltage also decreases.

# what other columns are there?
data_set.columns
Index(['weather_datetime', 'solar_datetime', 'solarRadiation', 'uvHigh',
       'winddirAvg', 'humidityHigh', 'humidityLow', 'humidityAvg', 'qcStatus',
       'tempHigh', 'tempLow', 'tempAvg', 'windspeedHigh', 'windgustLow',
       'windspeedAvg', 'dewptHigh', 'dewptLow', 'dewptAvg', 'windchillHigh',
       'windchillAvg', 'heatindexHigh', 'heatindexLow', 'heatindexAvg',
       'pressureMax', 'pressureMin', 'pressureTrend', 'precipRate',
       'precipTotal', 'DC'],
      dtype='object')
# what's the size of our data?
data_set.shape
(7960, 29)
# how distributed is the data?
data_set.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
solarRadiation uvHigh winddirAvg humidityHigh humidityLow humidityAvg qcStatus tempHigh tempLow tempAvg ... windchillAvg heatindexHigh heatindexLow heatindexAvg pressureMax pressureMin pressureTrend precipRate precipTotal DC
count 7960.000000 7960.000000 7960.000000 7960.000000 7960.000000 7960.000000 7960.000000 7960.000000 7960.000000 7960.000000 ... 7960.000000 7960.000000 7960.000000 7960.000000 7960.000000 7960.000000 7960.000000 7960.000000 7960.000000 7960.000000
mean 180.382851 1.844472 182.790075 45.861683 44.912437 45.077261 0.893970 53.745729 53.420603 53.554397 ... 53.418970 53.581533 53.236935 53.380653 30.200309 30.193024 0.000974 0.000469 0.012936 19.539283
std 264.082275 2.846040 78.432376 21.862087 21.940977 21.924786 0.311143 11.622671 11.565509 11.589908 ... 11.720226 11.353402 11.265849 11.305647 0.137469 0.137532 0.095205 0.005050 0.053583 19.753129
min 0.000000 0.000000 0.000000 11.000000 10.000000 10.000000 -1.000000 27.000000 27.000000 27.000000 ... 26.000000 27.000000 27.000000 27.000000 29.850000 29.830000 -0.600000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 139.000000 28.000000 27.000000 27.000000 1.000000 45.000000 45.000000 45.000000 ... 45.000000 45.000000 45.000000 45.000000 30.100000 30.100000 0.000000 0.000000 0.000000 0.030000
50% 0.000000 0.000000 197.000000 43.000000 42.000000 42.000000 1.000000 54.000000 54.000000 54.000000 ... 54.000000 54.000000 54.000000 54.000000 30.180000 30.180000 0.000000 0.000000 0.000000 8.379000
75% 333.120000 3.000000 215.000000 60.000000 59.000000 60.000000 1.000000 62.000000 62.000000 62.000000 ... 62.000000 62.000000 62.000000 62.000000 30.250000 30.250000 0.000000 0.000000 0.000000 39.823500
max 986.880000 10.000000 359.000000 98.000000 98.000000 98.000000 1.000000 83.000000 83.000000 83.000000 ... 83.000000 80.000000 80.000000 80.000000 30.610000 30.600000 0.600000 0.130000 0.370000 43.710000

8 rows × 27 columns

# Use pd.DataFrame.corr function to see what correlations can be identified between DC and other features.
data_set.corr(method="spearman")
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
solarRadiation uvHigh winddirAvg humidityHigh humidityLow humidityAvg qcStatus tempHigh tempLow tempAvg ... windchillAvg heatindexHigh heatindexLow heatindexAvg pressureMax pressureMin pressureTrend precipRate precipTotal DC
solarRadiation 1.000000 0.937831 -0.342999 -0.399324 -0.408921 -0.405694 -0.051987 0.452500 0.444020 0.448101 ... 0.446924 0.452235 0.442935 0.447495 0.046178 0.045571 -0.073512 0.001359 0.035785 0.814947
uvHigh 0.937831 1.000000 -0.316591 -0.407405 -0.418103 -0.414418 -0.048438 0.454691 0.445485 0.449924 ... 0.448278 0.454565 0.444371 0.449403 0.062698 0.062510 -0.096479 -0.043045 0.006510 0.700912
winddirAvg -0.342999 -0.316591 1.000000 0.319881 0.321128 0.320633 -0.058951 -0.362586 -0.361985 -0.362065 ... -0.361569 -0.362126 -0.361636 -0.361757 0.196054 0.195846 0.011043 -0.018541 0.003610 -0.259023
humidityHigh -0.399324 -0.407405 0.319881 1.000000 0.998860 0.999456 -0.132945 -0.765681 -0.763568 -0.764586 ... -0.759300 -0.765219 -0.763032 -0.764093 0.169556 0.169504 0.027204 0.197191 0.362307 -0.102328
humidityLow -0.408921 -0.418103 0.321128 0.998860 1.000000 0.999736 -0.132788 -0.767797 -0.765245 -0.766474 ... -0.761256 -0.767374 -0.764697 -0.765993 0.167956 0.167909 0.029168 0.196765 0.360056 -0.109135
humidityAvg -0.405694 -0.414418 0.320633 0.999456 0.999736 1.000000 -0.132995 -0.766938 -0.764562 -0.765708 ... -0.760461 -0.766498 -0.764012 -0.765218 0.168278 0.168241 0.028430 0.196955 0.361020 -0.106906
qcStatus -0.051987 -0.048438 -0.058951 -0.132945 -0.132788 -0.132995 1.000000 0.050799 0.052421 0.051585 ... 0.058814 0.050671 0.052195 0.051358 -0.130895 -0.131655 -0.013845 0.023923 0.051293 -0.128036
tempHigh 0.452500 0.454691 -0.362586 -0.765681 -0.767797 -0.766938 0.050799 1.000000 0.999030 0.999402 ... 0.998271 0.999708 0.998698 0.999141 -0.451769 -0.452192 -0.045211 -0.126698 -0.170193 0.176902
tempLow 0.444020 0.445485 -0.361985 -0.763568 -0.765245 -0.764562 0.052421 0.999030 1.000000 0.999544 ... 0.998397 0.998736 0.999677 0.999288 -0.455311 -0.455749 -0.043467 -0.125614 -0.170110 0.170358
tempAvg 0.448101 0.449924 -0.362065 -0.764586 -0.766474 -0.765708 0.051585 0.999402 0.999544 1.000000 ... 0.998819 0.999124 0.999221 0.999748 -0.453805 -0.454220 -0.044260 -0.126174 -0.170251 0.173594
windspeedHigh 0.397328 0.383542 -0.321357 -0.520153 -0.516107 -0.517716 -0.011902 0.500599 0.503205 0.502059 ... 0.484797 0.500218 0.502660 0.501699 -0.048984 -0.049268 -0.030898 -0.107910 -0.222856 0.242005
windgustLow 0.330530 0.317268 -0.276300 -0.407748 -0.403040 -0.404775 -0.000308 0.397769 0.401008 0.399604 ... 0.382369 0.397298 0.400612 0.399310 -0.041768 -0.041718 -0.019151 -0.086551 -0.173471 0.214411
windspeedAvg 0.387350 0.372748 -0.316086 -0.489531 -0.485025 -0.486786 -0.003781 0.472159 0.475211 0.473931 ... 0.456297 0.471763 0.474679 0.473568 -0.044306 -0.044478 -0.025686 -0.101311 -0.205448 0.245529
dewptHigh -0.052317 -0.050672 0.068162 0.567230 0.563345 0.565181 -0.143246 0.050101 0.052756 0.051704 ... 0.057759 0.050857 0.053645 0.052601 -0.235636 -0.236398 -0.020742 0.119437 0.299852 0.049395
dewptLow -0.080215 -0.080862 0.079540 0.593152 0.592413 0.593069 -0.141533 0.017321 0.020574 0.019246 ... 0.025345 0.018005 0.021514 0.020128 -0.227196 -0.227908 -0.017020 0.124784 0.308419 0.035530
dewptAvg -0.066127 -0.065882 0.074241 0.580213 0.577895 0.579206 -0.140929 0.034221 0.037206 0.036009 ... 0.042133 0.034966 0.038134 0.036922 -0.231846 -0.232595 -0.018377 0.122124 0.304470 0.041853
windchillHigh 0.452330 0.454327 -0.363098 -0.764098 -0.766250 -0.765379 0.055493 0.999509 0.998592 0.998945 ... 0.998978 0.999217 0.998260 0.998684 -0.453846 -0.454269 -0.045599 -0.126133 -0.168847 0.176803
windchillAvg 0.446924 0.448278 -0.361569 -0.759300 -0.761256 -0.760461 0.058814 0.998271 0.998397 0.998819 ... 1.000000 0.997993 0.998074 0.998567 -0.459162 -0.459578 -0.044587 -0.124342 -0.166472 0.173580
heatindexHigh 0.452235 0.454565 -0.362126 -0.765219 -0.767374 -0.766498 0.050671 0.999708 0.998736 0.999124 ... 0.997993 1.000000 0.998743 0.999269 -0.452040 -0.452460 -0.045153 -0.126702 -0.170172 0.176960
heatindexLow 0.442935 0.444371 -0.361636 -0.763032 -0.764697 -0.764012 0.052195 0.998698 0.999677 0.999221 ... 0.998074 0.998743 1.000000 0.999439 -0.455523 -0.455950 -0.043394 -0.125619 -0.170043 0.170265
heatindexAvg 0.447495 0.449403 -0.361757 -0.764093 -0.765993 -0.765218 0.051358 0.999141 0.999288 0.999748 ... 0.998567 0.999269 0.999439 1.000000 -0.454029 -0.454441 -0.044092 -0.126179 -0.170233 0.173561
pressureMax 0.046178 0.062698 0.196054 0.169556 0.167956 0.168278 -0.130895 -0.451769 -0.455311 -0.453805 ... -0.459162 -0.452040 -0.455523 -0.454029 1.000000 0.998638 -0.016431 -0.101865 -0.224125 0.081685
pressureMin 0.045571 0.062510 0.195846 0.169504 0.167909 0.168241 -0.131655 -0.452192 -0.455749 -0.454220 ... -0.459578 -0.452460 -0.455950 -0.454441 0.998638 1.000000 -0.016344 -0.103346 -0.224628 0.081169
pressureTrend -0.073512 -0.096479 0.011043 0.027204 0.029168 0.028430 -0.013845 -0.045211 -0.043467 -0.044260 ... -0.044587 -0.045153 -0.043394 -0.044092 -0.016431 -0.016344 1.000000 0.018658 0.018448 -0.027936
precipRate 0.001359 -0.043045 -0.018541 0.197191 0.196765 0.196955 0.023923 -0.126698 -0.125614 -0.126174 ... -0.124342 -0.126702 -0.125619 -0.126179 -0.101865 -0.103346 0.018658 1.000000 0.410878 0.094109
precipTotal 0.035785 0.006510 0.003610 0.362307 0.360056 0.361020 0.051293 -0.170193 -0.170110 -0.170251 ... -0.166472 -0.170172 -0.170043 -0.170233 -0.224125 -0.224628 0.018448 0.410878 1.000000 0.140794
DC 0.814947 0.700912 -0.259023 -0.102328 -0.109135 -0.106906 -0.128036 0.176902 0.170358 0.173594 ... 0.173580 0.176960 0.170265 0.173561 0.081685 0.081169 -0.027936 0.094109 0.140794 1.000000

27 rows × 27 columns

Observation: In relation to DC, it appears there is a strong correlation with:

  • solarRadiation - 0.8
  • uvHigh - 0.7

and loose correlation with:

  • tempHigh
  • tempLow
  • tempAvg
  • windchillAvg
  • heatindexHigh
  • heatindexLow
  • heatindexAvg
  • precipTotal

Does this reflect any information gathered from our research?

Step II: Feature Selection

We will split the data into features and labels and convert them into arrays to be used for our model.

import numpy as np
# we want to perdict DC
labels = np.array(data_set['DC'])
# Remove the labels and unimportant features from the features list.

col = [
 'weather_datetime',
 'solar_datetime',
 'winddirAvg',
 'humidityHigh',
 'humidityLow',
 'humidityAvg',
 'heatindexLow',
 'heatindexHigh',
 'heatindexAvg',
 'qcStatus',
 'windspeedHigh',
 'windgustLow',
 'windspeedAvg',
 'dewptHigh',
 'dewptLow',
 'dewptAvg',
 'windchillHigh',
 'windchillAvg',
 'pressureMax',
 'pressureMin',
 'pressureTrend',
 'precipRate',
 'precipTotal',
 'DC']

features= data_set.drop(col, axis = 1)
feature_list = list(features.columns)
features = np.array(features)

Step III: Build and Train Model

Split the data into train and test sets.

from sklearn.model_selection import train_test_split
# Note here that the test size is so low because I want to overfit the model since we have a separate test set.
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.1)
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)
Training Features Shape: (7164, 5)
Training Labels Shape: (7164,)
Testing Features Shape: (796, 5)
Testing Labels Shape: (796,)
# the features we will be using to predict DC
feature_list
['solarRadiation', 'uvHigh', 'tempHigh', 'tempLow', 'tempAvg']

Step III.i: Hyper Parameters Tuning

Hyper Parameters Tuning is good for figuring out what parameters will work the best for building the model. It's much better than guessing. Although it isn't perfect, it gives us some clues on what to try.

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

from scipy.stats import uniform as sp_randFloat
from scipy.stats import randint as sp_randInt

Gradient Boost


gradient_boost_model = GradientBoostingRegressor()
gradient_params = {'learning_rate': sp_randFloat(),
                'subsample'    : sp_randFloat(),
                'n_estimators' : sp_randInt(200, 2000),
                'max_depth'    : sp_randInt(10, 110)
             }
random_gradient = RandomizedSearchCV(estimator= gradient_boost_model, param_distributions = gradient_params, cv = 3, verbose=2, n_iter = 100, n_jobs=-1)
random_gradient.fit(train_features, train_labels)
Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed: 14.9min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 28.2min finished





RandomizedSearchCV(cv=3, estimator=GradientBoostingRegressor(), n_iter=100,
                   n_jobs=-1,
                   param_distributions={'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fe6ae2aedd0>,
                                        'max_depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fe6ae2aecd0>,
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fe6ae2ae050>,
                                        'subsample': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fe6ae2aee10>},
                   verbose=2)
# Results from Random Search
print(" Results from Random Search " )
print("\n The best estimator across ALL searched params:\n", random_gradient.best_estimator_)
print("\n The best score across ALL searched params:\n", random_gradient.best_score_)
print("\n The best parameters across ALL searched params:\n", random_gradient.best_params_)
print(random_gradient.score(test_features , test_labels))
 Results from Random Search 

 The best estimator across ALL searched params:
 GradientBoostingRegressor(learning_rate=0.01794706377831745, max_depth=32,
                          n_estimators=785, subsample=0.2873167459093807)

 The best score across ALL searched params:
 0.9655473309872132

 The best parameters across ALL searched params:
 {'learning_rate': 0.01794706377831745, 'max_depth': 32, 'n_estimators': 785, 'subsample': 0.2873167459093807}
0.9688789434674687

Step III.ii: Random Forest Model

# Instantiate model with 1500 decision trees
rf = RandomForestRegressor(n_estimators = 785, 
                           criterion="mse", 
                           max_depth = 32, 
                           min_samples_split = 2)
# Train the model on training data
rf.fit(train_features, train_labels)
RandomForestRegressor(max_depth=32, n_estimators=785)

Step III.iii: Accuracy - R2 Score

Let's see what the accuracy our model is using the training set provided.

y_pred = rf.predict(test_features)
from sklearn.metrics import r2_score

r2_score(test_labels, y_pred)
0.9709460517127418

Comment: Our model has a accuracy of 97%! That's not bad at all.

Sept IV: Predictions Using Test Dataset

Now we will test our model using the test set. Remember that whatever we did to the training set must also be done to the testing set!

test_set = pd.read_csv("cahsi_data_2020/D2.csv")


col = [
 'weather_datetime',
 'solar_datetime',
 'winddirAvg',
 'humidityHigh',
 'humidityLow',
 'humidityAvg',
 'heatindexLow',
 'heatindexHigh',
 'heatindexAvg',
 'qcStatus',
 'windspeedHigh',
 'windgustLow',
 'windspeedAvg',
 'dewptHigh',
 'dewptLow',
 'dewptAvg',
 'windchillHigh',
 'windchillAvg',
 'pressureMax',
 'pressureMin',
 'pressureTrend',
 'precipRate',
 'precipTotal']

testset_features = test_set.drop(col, axis = 1)
testset_features = np.array(testset_features)
# Use the forest's predict method on the test data
predictions = rf.predict(testset_features)
predictions
array([1.20963236, 1.20963236, 0.70364704, ..., 6.45444127, 6.45444127,
       6.45444127])

Step V: Dump predictions into text file for later use.

print('Predictions:\n', predictions) 
file = open("answer.txt", "w") 

for num in predictions:

    content = str(num)
    file.write(content)
    file.write("\n")

file.close()
Predictions:
 [1.20963236 1.20963236 0.70364704 ... 6.45444127 6.45444127 6.45444127]