/pvtools

Tools for scraping and analyzing data from my photovoltaic power system

Primary LanguagePythonMIT LicenseMIT

pvtools

Tools for scraping and analyzing data from my photovoltaic power system

Keywords: Python 3, lxml, psycopg2, PostgreSQL, ETL

Introduction

The controller for my photovoltaic system hosts a webserver that provides system status including the current production of the system. It does not provide long-term logging or data analysis.

Several years ago, I created a cron job to sample the state of the system every five minutes. The script to do this is simple. It uses wget to download the files.

wget \
    --append-output=solar.log \
    --no-cookies \
    --directory-prefix=$(date -Iminute) \
    --adjust-extension \
    --convert-links \
    --page-requisites \
    --execute robots=off \
    --no-parent \
    http://192.168.1.215/home \
    http://192.168.1.215/production

At the time, I needed get something running quickly. Solar City sent us a notice that we would lose our nice 3rd party monitoring by Enphase Energy if we didn't pay them a bunch of money to install their new system. The notice eventually turned out to be false.

The wget approach is successful in that it does capture all the data and will continue to do so even if the output format changes (provided the URLs remain unchanged). However, the data is uncompressed, contains redundant files, and requires additional steps to extract the useful information.

The goals of this project are to extract the data into a more useful and more compact form while demonstrating what's possible in a few lines of Python. I've tried to strike a compromise between a quick-but-brittle hack and a general system that might handle many more use cases than I have.

At a minimum, monitoring and logging should report the following:

  • System outages
  • Production on the peak day of the year
  • Total production for the year

We monitor system outages via an indicator LED. Now that we know the remote monitoring isn't going away, relying on an email from Enphase if we don't happen to notice the LED is fine for now.

Due to weather and pollution, PV output varies from year to year, but peak production and annual production tell us how the system is degrading over time.

Here's a brittle hack I've used grep to pluck out the current production:

wget -qO- http://192.168.1.215/home|grep -o "generating.*"|grep -o "[0-9.][0-9.]*"

This pipes home.html to grep matching all lines containing the word 'generating' and outputing line from the word 'generating' on. The second grep extracts the longest sequence of one or more digits. What could go wrong? Well, the units of the number change depending its magnitude, for one thing. It works for setting an indicator light when the panels are generating non-zero power. But more importantly, this expression doesn't offer much in the way of error detection and this style is too cryptic for a team effort. Which brings us to Python.

System Overview

The main code is pvtodb.py.

    python3 pvtodb.py dir1 [dir2 ...]

pvtodb does three things:

  • Looks for specific HTML file names
  • Parses the HTML
  • Extracts selected table values
  • Writes the values to a PostgreSQL database

External Libraries

The libraries available make short work of this and give much improved functionality.

lxml provides a Python interface to libxml2 and xslt. lxml parses the HTML for us and allows us to use high-level patterns on the structure of the HTML to extract the data.

psycopg2 provides a Python interface to PostgreSQL. (In case you're wondering why I didn't use a spreadsheet, I've got over 150k data points to analyze and it was time to refresh my database skills.)

dateutil provides a way to detect badly formatted dates. This is most likely to happen if pytodb is given an invalid directory, but that's easy to do.

I explored using one of the many libraries that offers unit checking for numerical values such as pint, but decided it would be more work to find the best one than to write a seven-line function.

Handy SQL Queries

Show the production for each week:

SELECT *
FROM
  (SELECT extract(week FROM time) AS week,
          extract(YEAR FROM time) AS year,
	  max(week_wh) AS max_wh
   FROM production GROUP BY week, year) a
ORDER BY year, week;

Show the peak production values per year:

SELECT *
FROM
  (SELECT extract(year FROM time) AS year,
          max(power_w) AS power_w
   FROM production GROUP BY year) a
ORDER BY year;