Machine Learning models for predicting solar radiation using GAEMN weather data and NAM weather forecasts.
Georgia Power has installed a solar farm near the University of Georgia's campus. However, its difficult to use the farm effectively without a reliable prediction of the expected energy output. A team of researchers at UGA have been working to develop predictive models using machine learning.
Sam Sanders, et al. trained Random Forest regressors to predict observed solar radiation 1 and 24 hours into the future. He showed that prediction accuracy can be improved by incorporating a 24-hour window of radiation observations and by including a current weather forecast.
The purpose of this project is 3-fold:
- Replicate Sanders' results
- Extend the work temporally and ensure that results hold for every hour up to a 36 hour prediction
- Make the epxeriments more easily replicable
Available data is stored in a MySQL database. It includes weather observations from the Georgia Automated Environmental Monitoring Network (GAEMN) and weather predictions from the North American Mesoscale Forecast System (NAM) for five point locations in Georgia for most of 2011 and some of 2012. Previously, the datasets used for training/testing were extracted by manually running SQL queries and exporting the results as a CSV. This project automates the data retrieval by dynamically constructing the SQL queries based on the desired dataset and time interval. This alleviates the necessity of manually modifying and running queries for each experiments
Previously, models were trained using the WEKA software.
This project ports Sanders' experiments to scikit-learn.
Although implementations differ slightly between WEKA and scikit
, results were successfully replicated using
scikit-learn
models.
Finally, Sanders' experiments were extended temporally. This project builds models for every prediction hour extending to 36 hours in the future. Results seem to hold regardless of the prediction hour.
Originally, I was manually running scikit
models on the CSV datasets extracted from Sam's SQL queries.
We were getting results much worse than Sam. After writing the module to extract data automatically, the
results matched. It's possible I was using an old dataset or that pandas was ordering rows strangely.
In our experiments, adding a 24-hour window of past observations actually hurts prediction accuracy.
This doesn't make much sense intuitively, since adding additional attributes shouldn't hurt the accuracy
of a properly trained tree ensemble. It may be something odd with scikit
's splitting mechanics.
In our experiments, considering NAM data in isolation (that is, without GAEMN observations and without a 24-hour window) produces superior prediction accuracy over any other dataset. Sanders never never trained a model using NAM-only data, so this result does not represent a deviation from expected results but rather an interesting note.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
This project uses Python to run experiments and build models.
Python installation and dependencies are managed using the Conda package manager. You'll need to install Conda to get setup.
The environment.yml file is used by Conda to create a virtual environment that includes all the project's dependencies (including Python!)
Navigate to the project directory and run the following command
conda env create -f environment.yml
This will create a virtual environment named "solarrad". Activate the virtual environment with the following command
conda activate solarrad
Make a copy of config.db.example.json
and rename it to config.db.json
.
Modify the connection settings in config.json
with the correct database name, location,
and credentials.
host
: the network address of the database serverport
: the port on which the server is running (usually 3306)database
: the name of the database to useuser
: usernamepassword
: passwordraise_on_warnings
: Official Docsuse_pure
: Official Docs
Experiments are stored in the experiments
subdirectory. Currently, the only working experiment is
rf.py
. After the environment has been activated, the experiments can be run as follows:
$ python experiments/rf.py
- Python 3.6
- Conda - Package Manager
1/22/18: Established goal of replicating Sam's work and extending it temporally.
2/6/18: Noticed weird behavior when running custom SQL queries for different hours. We were generating several identical rows but with different target values.
2/13/18: Discovered that the previous problem was caused by failing to enforce matching of the site_id
when
joining rows. Easy fix.
2/20/18: Was able to replicate Sam's results for 1, 6, 12, 18, and 24 hour predictions. However, it takes a really long time to generate these datasets by hand and run experiments using WEKA. Updated goal: port everything to Python using the usualy Python data science tools.
2/27/18: Noticed that scikit
's RF regressor was giving much worse results than WEKA. One possible explanation
is the lack of post-pruning in the scikit-learn
implementation. New goal: try several different ML packages and see
if we can establish whether WEKA or scikit
results are typical.
3/3/18: Ran a few experiments with the H2O framework, and results mostly agreed with WEKA. Ported dataset generation
to Python using the mysql-python-connector. Tried to ensure sklearn
RF hyperparams were the same as WEKA.
Resolved problem with long runtimes by switched from the mae
splitting criterion to mse
.
3/4/18: Ran into a new problem. mySQL
queries with complicated joins were timing out for the 3x3 grid
multi-cell queries. Problem was traced to an inner join that uses the DATE_SUB
function. Replaced the DATE_SUB
call with a query that relies on the ordering of the rows. This makes the query less readable and less robust
but was the only solution I could come up with quickly.
with new data loader, result discrepancy is cleared up. Unsure why. Perhaps my CSV files had been generated incorrectly.
Switched to train-test split for evaluating results instead of 10-fold CV. This makes experiments run way faster with no discernable difference in results.
This project is licensed under the MIT License - see the LICENSE.md file for details