canadaHCDx: An expansion to provide additional functions for the canadaHCD
package
Environment and Climate Change Canada (ECCC) provides archival climate data for a wealth of climate stations across the country. This data is available for download on the ECCC data portal. This website is useful for browsing data, and is convenient for downloading short periods of data. However, if you are seeking a 30-year baseline, the manual download of the necessary data can become very tedious. In that regard, ECCC also provides a bulk download function to make acquiring that data a little bit easier. While the process is much simpler than it has been in the past, one still must put in a little leg work to:
- Identify the station of interest
- Find the correct data set
- Download the relevant data files
- Process the files into workable data sets
To overcome these limitations, Gavin L. Simpson has written a package for R, which performs the above processes with minimal user input. The canadaHCD
package is a very useful tool, however, I noticed that there were a couple of features missing that I needed for my personal workflow. This repository provides a small expansion for the canadaHCD
package. It adds additional search options. In this README, I will strive to outline the basic functionality of both the original package and my expansion.
Most R packages are installed from the Comprehensive R Archive Network (CRAN). canadaHCD
, however, remains under development and is not hosted on the CRAN. Its source code is freely available on GitHub, so we can harness the wonderful remotes
package to install it. You may also require the "git2r
" package to access the git repository.
For you convenience, I have written a quick script that will install remotes
, git2r
, canadaHCD
, canadaHCDx
, and all dependencies. If you would rather install these components individually, take a look at the script to see which commands to run.
source("https://gitlab.com/ConorIA/canadaHCDx/raw/master/install_canadaHCDx.R")
Note that the above script can also be used to keep remotes
, canadaHCD
and canadaHCDx
up to date.
Once the packages have installed, load canadaHCDx
by:
library(canadaHCDx) #canadaHCDx will load canadaHCD automatically
Simpson (2016) has already outlined the basic functions of this package in great detail on the README for his GitHub repository. However, I will include a basic overview of the functions in this README.
To search for a station by name, use the find_station()
function. For instance, to search for a station with the word 'Toronto' in the station name, use the following code:
find_station("Toronto")
## # A tibble: 93 x 5
## Name Province StationID LatitudeDD LongitudeDD
## <chr> <chr> <int> <dbl> <dbl>
## 1 PA TORONTO INTERNATIONAL TRA… ONTARIO 52731 44.2 -79.7
## 2 PA TORONTO NORTH YORK MOTORS ONTARIO 52678 43.7 -79.5
## 3 PA SCARBOROUGH TORONTO HUNT ONTARIO 52641 43.7 -79.3
## 4 PA TORONTO HYUNDAI ONTARIO 52640 43.7 -79.4
## 5 TORONTO ONTARIO 5051 43.7 -79.4
## 6 TORONTO CITY ONTARIO 31688 43.7 -79.4
## 7 TORONTO CITY CENTRE ONTARIO 48549 43.6 -79.4
## 8 TORONTO AGINCOURT ONTARIO 5052 43.8 -79.3
## 9 TORONTO ASHBRIDGES BAY ONTARIO 5053 43.7 -79.3
## 10 TORONTO BALMY BEACH ONTARIO 5054 43.7 -79.3
## # ... with 83 more rows
Note that the tibble
object (a special sort of data.frame
) won't print more than the first 10 rows by default. To see all of the results, you can wrap the command in View()
so that it becomes View(find_station("Toronto"))
.
Note that you can also use wildcards as supported by the glob2rx()
from the utils
package by passing the argument glob = TRUE
, as in the following example.
find_station("Tor*", glob = TRUE)
## # A tibble: 94 x 5
## Name Province StationID LatitudeDD LongitudeDD
## <chr> <chr> <int> <dbl> <dbl>
## 1 TORRENS LO ALBERTA 2440 54.3 -120.
## 2 TORQUAY SASKATCHEWAN 3034 49.1 -104.
## 3 TORONTO ONTARIO 5051 43.7 -79.4
## 4 TORONTO CITY ONTARIO 31688 43.7 -79.4
## 5 TORONTO CITY CENTRE ONTARIO 48549 43.6 -79.4
## 6 TORONTO AGINCOURT ONTARIO 5052 43.8 -79.3
## 7 TORONTO ASHBRIDGES BAY ONTARIO 5053 43.7 -79.3
## 8 TORONTO BALMY BEACH ONTARIO 5054 43.7 -79.3
## 9 TORONTO BEACON ROAD ONTARIO 5055 43.8 -79.3
## 10 TORONTO BERMONDSEY ONTARIO 5056 43.7 -79.3
## # ... with 84 more rows
Once you have found your station of interest, you can download hourly, daily and monthly data using the hcd_hourly()
, hcd_daily()
, and hcd_monthly()
functions, respectively. Look at the documentation for each function by typing a question mark, followed by the function name, or using the help()
function. For instance, to find out what arguments the hcd_daily()
function takes, type ?hcd_daily
or help(hcd_daily)
.
If I wanted to download daily data for Toronto City (station no. 5051) from 1971 to 2000, I could use:
city <- hcd_daily(5051, 1971:2000)
city
## # A tibble: 10,958 x 13
## Station Date MaxTemp MinTemp MeanTemp HeatDegDays CoolDegDays
## <int> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5051 1971-01-01 -0.6 -10 -5.3 23.3 0
## 2 5051 1971-01-02 3.3 -8.9 -2.8 20.8 0
## 3 5051 1971-01-03 2.2 0 1.1 16.9 0
## 4 5051 1971-01-04 6.7 -0.6 3.1 14.9 0
## 5 5051 1971-01-05 -0.6 -3.3 -2 20 0
## 6 5051 1971-01-06 -4.4 -8.9 -6.7 24.7 0
## 7 5051 1971-01-07 -5.6 -11.7 -8.7 26.7 0
## 8 5051 1971-01-08 -6.1 -13.9 -10 28 0
## 9 5051 1971-01-09 -1.1 -8.3 -4.7 22.7 0
## 10 5051 1971-01-10 0.6 -5.6 -2.5 20.5 0
## # ... with 10,948 more rows, and 6 more variables: TotalRain <dbl>,
## # TotalSnow <dbl>, TotalPrecip <dbl>, GroundSnow <int>,
## # MaxGustDir <int>, MaxGustSpeed <chr>
Make sure to use the assignment operator (<-
) to save the data into an R object, otherwise the data will just print out to the console, and won't get saved anywhere in the memory.
Users of the canadaHCD
package can largely avoid using the Environment and Climate Change Canada website to acquire climate data, however, there were a few search features on the ECCC website that weren't available in the installed package.
This package will download the official Environment and Climate Change Canada station inventory table. This means that changes to station codes or the like won't leave you scratching your head. The inventory in the main canadaHCD
package, on the other hand, hasn't been updated in some time. To cache the latest data manually, run get_station_data()
.
This packages masks find_station()
from canadaHCD
with a version that includes the ability to search for stations by name (as per the original), but adds filters to search by province, by a given baseline period, and by proximity to another station or a vector of coordinates. You can use any combination of these four filters in your search. There are a few mandatory arguments for each filter. For instance, if you are searching for a certain baseline period, you must also include the type of data you are looking for (hourly, daily, or monthly; defaults to daily). The function is fully documented, so take a look at ?find_station
. Let's look at some examples.
find_station(province = "ON")
## # A tibble: 1,626 x 5
## Name Province StationID LatitudeDD LongitudeDD
## <chr> <chr> <int> <dbl> <dbl>
## 1 ATTAWAPISKAT ONTARIO 3913 52.9 -82.4
## 2 ATTAWAPISKAT A ONTARIO 52918 52.9 -82.4
## 3 BIG TROUT LAKE A ONTARIO 49488 53.8 -89.9
## 4 BIG TROUT LAKE ONTARIO 3914 53.8 -89.9
## 5 BIG TROUT LAKE READAC ONTARIO 10804 53.8 -89.9
## 6 BIG TROUT LAKE ONTARIO 50857 53.8 -89.9
## 7 CENTRAL PATRICIA ONTARIO 3915 51.5 -90.2
## 8 DONA LAKE ONTARIO 3916 51.4 -90.1
## 9 EAR FALLS ONTARIO 3917 50.6 -93.2
## 10 EAR FALLS (AUT) ONTARIO 27865 50.6 -93.2
## # ... with 1,616 more rows
Note: You can use a vector of more than on province, e.g. province = c("ON", "QC")
.
find_station("Toronto", baseline = 1971:2000, type = "hourly")
## # A tibble: 2 x 7
## Name Province StationID LatitudeDD LongitudeDD HourlyFirstYr
## <chr> <chr> <int> <dbl> <dbl> <int>
## 1 TORONTO ISLAND A ONTARIO 5085 43.6 -79.4 1957
## 2 TORONTO LESTER … ONTARIO 5097 43.7 -79.6 1953
## # ... with 1 more variable: HourlyLastYr <int>
find_station(target = 5051, dist = 0:100)
## # A tibble: 506 x 6
## Name Province StationID LatitudeDD LongitudeDD Dist
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 TORONTO ONTARIO 5051 43.7 -79.4 0
## 2 TORONTO CITY ONTARIO 31688 43.7 -79.4 0
## 3 TORONTO DEER PARK ONTARIO 5065 43.7 -79.4 1.96
## 4 PA MATTAMY ATHLETIC CE… ONTARIO 52723 43.7 -79.4 1.96
## 5 TORONTO SHERBOURNE ONTARIO 5089 43.6 -79.4 3.29
## 6 PA DUFFERIN AND ST. CL… ONTARIO 53838 43.7 -79.4 3.41
## 7 TORONTO BROADVIEW ONTARIO 5063 43.7 -79.4 4.03
## 8 TORONTO LEASIDE S ONTARIO 5095 43.7 -79.4 4.12
## 9 TORONTO NORTHCLIFFE ONTARIO 5108 43.7 -79.4 4.18
## 10 TORONTO CITY CENTRE ONTARIO 48549 43.6 -79.4 4.44
## # ... with 496 more rows
find_station(target = c(43.7860, -79.1873), dist = 0:5)
## # A tibble: 8 x 6
## Name Province StationID LatitudeDD LongitudeDD Dist
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 PA U OF T SCARBOROUGH T… ONTARIO 52724 43.8 -79.2 0.701
## 2 TOR SCARBOROUGH COLLEGE ONTARIO 5091 43.8 -79.2 0.889
## 3 TORONTO HIGHLAND CREEK ONTARIO 5080 43.8 -79.2 1.54
## 4 TORONTO WEST HILL ONTARIO 5120 43.8 -79.2 2.26
## 5 TORONTO MALVERN ONTARIO 5099 43.8 -79.2 3.77
## 6 TORONTO METRO ZOO ONTARIO 5102 43.8 -79.2 3.82
## 7 TORONTO CURRAN HALL ONTARIO 5064 43.8 -79.2 4.13
## 8 ROUGE HILLS ONTARIO 5024 43.8 -79.1 4.87
There have been a number of cases where the same station has changed name and ID over its history. In this case, filtering by baseline might exclude these stations. If you would like the find_station_adv()
function to try to identify these cases, pass duplicates = TRUE
to the function. The function will report any combination for which the coordinates are the same, and which, together, provide sufficient baseline data.
find_station(baseline = 1981:2010, target = 5051, dist = 0:10, duplicates = TRUE)
## Note: In addition to the stations found, the following combinations may provide sufficient baseline data.
##
## ## Combination 1 at coordinates 43.63 -79.4
##
## 5085: TORONTO ISLAND A
## 5086: TORONTO IS A (AUT)
## 30247: TORONTO CITY CENTRE
## 48549: TORONTO CITY CENTRE
## # A tibble: 1 x 8
## Name Province StationID LatitudeDD LongitudeDD DailyFirstYr DailyLastYr
## <chr> <chr> <int> <dbl> <dbl> <int> <int>
## 1 TORO… ONTARIO 5051 43.7 -79.4 1840 2017
## # ... with 1 more variable: Dist <dbl>
Sometimes a long list of stations is hard to visualize spatially. To overcome this, I added a third function, which takes a list of stations and shows them on a map powered by the Leaflet library. The map function is even smart enough to take a search as its list of stations as per the example below. Unfortuntely, this does not render properly on GitLab.
map_stations(find_station(target = c(43.7860, -79.1873), dist = 10:20), zoom = 6)
Note: use type = "sentinel"
for cloudless satellite imagery from EOX IT Services GmbH.
The quick_audit()
function will return a tibble listing the percentage or number of missing values for a station. For now, it has only been tested with daily data. The following code will return the percentage of missing values at Yellowknife Hydro:
yh <- hcd_daily(1707, 1981:2010)
quick_audit(yh, c("MaxTemp", "MinTemp", "MeanTemp"))
## # A tibble: 30 x 4
## Year `MaxTemp pct NA` `MinTemp pct NA` `MeanTemp pct NA`
## <int> <dbl> <dbl> <dbl>
## 1 1981 8.22 8.22 8.22
## 2 1982 8.49 8.49 8.49
## 3 1983 0 0 0
## 4 1984 0 0 0
## 5 1985 33.7 33.7 33.7
## 6 1986 33.2 33.2 33.2
## 7 1987 8.49 8.49 8.49
## 8 1988 50.3 50.3 50.3
## 9 1989 100 100 100
## 10 1990 8.22 8.22 8.22
## # ... with 20 more rows
Use report = "n"
to show the number of missing values. Use by = "month"
to show missing data by month instead of year. To show the number of missing values at Toronto Pearson Airport in 1993:
pears <- hcd_daily(5097, 1993)
quick_audit(pears, "MeanTemp", by = "month", report = "n")
## # A tibble: 12 x 5
## Year Month `Year-month` `MeanTemp consec NA` `MeanTemp tot NA`
## <chr> <chr> <S3: yearmon> <int> <int>
## 1 1993 01 Jan 1993 1 2
## 2 1993 02 Feb 1993 1 2
## 3 1993 03 Mar 1993 1 4
## 4 1993 04 Apr 1993 2 7
## 5 1993 05 May 1993 1 2
## 6 1993 06 Jun 1993 1 2
## 7 1993 07 Jul 1993 1 1
## 8 1993 08 Aug 1993 1 1
## 9 1993 09 Sep 1993 4 10
## 10 1993 10 Oct 1993 2 10
## 11 1993 11 Nov 1993 3 9
## 12 1993 12 Dec 1993 2 5
The package outlined in this document is published under the GNU General Public License, version 3 (GPL-3.0). The GPL is an open source, copyleft license that allows for the modification and redistribution of original works. This license is what makes it perfectly legal for me to build upon Simpson's work and modify it as I see fit. It is important to note, however, that programs licensed under the GPL come with NO WARRANTY. In our case, a simple R package isn't likely to blow up your computer or kill your cat. Nonetheless, it is always a good idea to pay attention to what you are doing, to ensure that you have downloaded the correct data, and that everything looks ship-shape.
If you run into an issue while you are using the package, you can email me and I can help you troubleshoot the issue. However, if the issue is related to the package code and not your own fault, you should contribute back to the open source community by reporting the issue. If the issue is in one of the basic canadaHCD
functions, report the issue on GitHub. If your issue is with one of my extended functions, report the issue to me here on GitLab.
If that seems like a lot of work, just think about how much work it would have been to do all the work this package does for you, or how much time went in to writing these functions!