butteR can be used to smooth out the analysis and visualization of spatial survey data collected using mobile data collection systems (ODK/XLSform). ButteR mainly consists of convenient wrappers and pipelines for the survey, srvyr, sf, and rtree packages.
You can install the the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("zackarno/butteR")
## Example
The stratified sampler function can be useful if you want to generate random samples from spatial point data. It has been most useful for me when I have shelter footparint data that I want to sample. For now, the function only reads in point data. Therefore, if the footprint data you have is polygons it should first be converted to points (centroids).
I believe the most useful/powerful aspect of this function is the ability to write out well labelled kml/kmz files that can be loaded onto phone and opened with maps.me or other applications. To use this function properly it is important that you first familiarize yourself with some of the theory that underlies random sampling and that you learn how “seeds” can be used/set in R to make random sampling reproducible. The function generates randome seeds and stores it as a an attribute field of the spatial sample. There is also the option to write the seed to the working directory as text file. Understanding how to use the seeds becomes important if you want to reproduce your results, or if you need to do subsequent rounds of sampling where you want to exclude the previous sample without having to read in the previous samples.
To show how the function can be used I will first simulate a spatial data set and sample frame
library(butteR)
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.1
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
lon<-runif(min=88.00863,max=92.68031, n=1000)
lat<-runif(min=20.59061,max=26.63451, n=1000)
strata_options<-LETTERS[1:8]
#simulate datasets
pt_data<-data.frame(lon=lon, lat=lat, strata=sample(strata_options,1000, replace=TRUE))
sample_frame<-data.frame(strata=strata_options,sample_size=round(runif(10,100,n=8),0))
Here are the first six rows of data for the sample frame and data set
pt_data %>% head() %>% knitr::kable()
lon | lat | strata |
---|---|---|
90.14262 | 26.06148 | D |
91.21273 | 23.59155 | C |
90.19238 | 26.24277 | E |
90.02332 | 25.27046 | H |
89.53342 | 20.90264 | G |
88.85128 | 20.98232 | G |
sample_frame %>% head() %>% knitr::kable()
strata | sample_size |
---|---|
A | 33 |
B | 69 |
C | 39 |
D | 85 |
E | 30 |
F | 16 |
Next we will run the stratified_sampler function using the two simulated data sets as input.
You can check the function help file by typing ?stratified_sampler. There are quite a few parameters to set particularly if you want to write out the kml file. Therefore, it is important to read the functions documentation (it will be worth it).
sampler_ouput<-butteR::stratified_sampler(sample.target.frame = sample_frame,
sample.target.frame.strata = "strata",
sample.target.frame.samp.size = "sample_size",pt.data =pt_data,
pt.data.strata = "strata",pt.data.labels = "strata" ,write_kml = FALSE
)
The output is stored in a list. Below is the first 6 results of each stratified sample. The results are stratified sample. They can be viewed collectively or one at a time.
sampler_ouput$results %>% purrr:::map(head) %>% knitr::kable()
Description | rnd_seed | uuid |
---|---|---|
1_A | 828005 | 27 |
2_A | 828005 | 68 |
3_A | 828005 | 83 |
4_A | 828005 | 100 |
5_A | 828005 | 101 |
6_A | 828005 | 124 |
Description | rnd_seed | uuid |
---|---|---|
1_B | 828005 | 10 |
2_B | 828005 | 41 |
3_B | 828005 | 44 |
4_B | 828005 | 62 |
5_B | 828005 | 69 |
6_B | 828005 | 92 |
Description | rnd_seed | uuid |
---|---|---|
1_C | 828005 | 2 |
2_C | 828005 | 32 |
3_C | 828005 | 36 |
4_C | 828005 | 45 |
5_C | 828005 | 110 |
6_C | 828005 | 138 |
Description | rnd_seed | uuid |
---|---|---|
1_D | 828005 | 1 |
2_D | 828005 | 12 |
3_D | 828005 | 13 |
4_D | 828005 | 17 |
5_D | 828005 | 28 |
6_D | 828005 | 51 |
Description | rnd_seed | uuid |
---|---|---|
1_E | 828005 | 33 |
2_E | 828005 | 50 |
3_E | 828005 | 66 |
4_E | 828005 | 87 |
5_E | 828005 | 109 |
6_E | 828005 | 146 |
Description | rnd_seed | uuid |
---|---|---|
1_F | 828005 | 135 |
2_F | 828005 | 153 |
3_F | 828005 | 317 |
4_F | 828005 | 381 |
5_F | 828005 | 402 |
6_F | 828005 | 462 |
Description | rnd_seed | uuid |
---|---|---|
1_G | 828005 | 5 |
2_G | 828005 | 6 |
3_G | 828005 | 14 |
4_G | 828005 | 19 |
5_G | 828005 | 20 |
6_G | 828005 | 25 |
Description | rnd_seed | uuid |
---|---|---|
1_H | 828005 | 23 |
2_H | 828005 | 24 |
3_H | 828005 | 30 |
4_H | 828005 | 49 |
5_H | 828005 | 75 |
6_H | 828005 | 85 |
sampler_ouput$results$D %>% head()
#> Description rnd_seed uuid
#> 1 1_D 828005 1
#> 2 2_D 828005 12
#> 3 3_D 828005 13
#> 4 4_D 828005 17
#> 5 5_D 828005 28
#> 6 6_D 828005 51
The random_seed is saved in the list as well as an attribute of each stratified sample. The random seed is very important for reproducibility which is quite useful for subsequent rounds of data collection
sampler_ouput$random_seed
#> [1] 828005
You can also view all of the remaining points which were not not randomly sampled. You can choose to have these written to a shape file. It is generally a good back up policy to write these as well.
sampler_ouput$samp_remaining %>% head() %>% knitr::kable()
lon | lat | strata | uuid | rnd_seed | |
---|---|---|---|---|---|
3 | 90.19238 | 26.24277 | E | 3 | 828005 |
4 | 90.02332 | 25.27046 | H | 4 | 828005 |
7 | 90.77956 | 25.45381 | E | 7 | 828005 |
8 | 90.88944 | 22.56836 | G | 8 | 828005 |
9 | 90.76433 | 21.99042 | A | 9 | 828005 |
11 | 90.83148 | 25.57179 | E | 11 | 828005 |
First I will generate 2 fake point data sets. The sf package is great!
library(sf)
set.seed(799)
lon1<-runif(min=88.00863,max=92.68031, n=1000)
lat1<-runif(min=20.59061,max=26.63451, n=1000)
lon2<-runif(min=88.00863,max=92.68031, n=1000)
lat2<-runif(min=20.59061,max=26.63451, n=1000)
strata_options<-LETTERS[1:8]
#make a simulated dataset
pt_data1<-data.frame(lon=lon1, lat=lat1, strata=sample(strata_options,1000, replace=TRUE))
pt_data2<-data.frame(lon=lon2, lat=lat2, strata=sample(strata_options,1000, replace=TRUE))
# convert to simple feature object
coords<- c("lon", "lat")
pt_sf1<- sf::st_as_sf(x = pt_data1, coords=coords, crs=4326)
pt_sf2<- sf::st_as_sf(x = pt_data2, coords=coords, crs=4326)
Next I will show two spatial verification functions. The first one just finds the closest distance between points. It uses rTree spatial indexing so it will work quickly on fairly large datasets.
closest_pts<- butteR::closest_distance_rtree(pt_sf1, pt_sf2)
#> Warning in rtree::knn.RTree(rTree = sf2_tree, st_coordinates(sf1)[,
#> c("X", : k was cast to integer, this may lead to unexpected results.
closest_pts %>% head() %>% knitr::kable()
strata | geometry | strata.1 | geometry.1 | dist_m | |
---|---|---|---|---|---|
755 | C | c(88.5246591396806, 26.0766159565661) | H | c(88.542828683707, 25.8766529368377) | 22228.020 |
798 | C | c(91.3460825806255, 22.3494960887145) | F | c(91.3754625593381, 22.3643193468922) | 3442.702 |
464 | C | c(91.6884048353551, 26.0950136747809) | B | c(91.6959527733822, 26.0490176807472) | 5151.514 |
902 | B | c(88.782772209299, 22.2289078448025) | C | c(88.812609722456, 22.2312796777867) | 3087.283 |
199 | B | c(91.9385484030803, 22.9929798167442) | A | c(92.0439420932042, 22.9314622797974) | 12776.161 |
419 | D | c(88.6396377435045, 22.2862520419468) | C | c(88.7253538271838, 22.3836231110146) | 13936.767 |
You could easily just filter the “closest_pts” ouput by a distance threshold of your choice. However to make it simpler I have wrapped this function in the function “check_distances_from_target” (I need to come up with a better name for this function). It will return all of the points in from “dataset”that are further than the set threshold from any point in the “target_points”. It will also show you the distance to the closest target point. Obviously this is fake data so there are a ton of points returned (I will just display the first 6 rows). In your assessment dat there should obviously be much less.
set.seed(799)
pts_further_than_50m_threshold_from_target<-
butteR::check_distances_from_target(dataset = pt_sf1,target_points =pt_sf2,dataset_coordinates = coords,
cols_to_report = "strata", distance_threshold = 50)
#> Warning in rtree::knn.RTree(rTree = sf2_tree, st_coordinates(sf1)[,
#> c("X", : k was cast to integer, this may lead to unexpected results.
pts_further_than_50m_threshold_from_target %>% head() %>% knitr::kable()
strata | dist_m |
---|---|
C | 22228.020 |
C | 3442.702 |
C | 5151.514 |
B | 3087.283 |
B | 12776.161 |
D | 13936.767 |