The goal of distops is to provide a set of functions to compute distances between observations in a sample and to perform operations on distance matrices.
You can install the development version of distops from GitHub with:
# install.packages("devtools")
devtools::install_github("LMJL-Alea/distops")
library(distops)
We provide two functions for package developers to help with defining
efficient implementation of the dist
functions for custom distances.
Namely:
use_distops()
setups a package to use distops for computing distances. In particular, it creates asrc/
directory with aMakevars
file and aMakevars.win
file. It also creates aR/distops-package.R
file with the appropriate roxygen2 tags so that theNAMESPACE
file is modified to add theimportFrom()
directives for the Rcpp and RcppParallel packages and theuseDynLib()
directive for packages with compiled code. It finally modifies theDESCRIPTION
file to add Rcpp, RcppParallel and distops to theImports
andLinkingTo
fields and GNU make to theSystemRequirements
field.use_distance()
creates R and C++ files for easy implementation of custom distances.
Let us compute the Euclidean distance matrix for the iris
dataset:
D <- dist(iris[, 1:4], method = "euclidean")
We can subset this matrix using the [
operator. We can either provide
the same indices for rows and columns in which case it return another
object of class dist
:
D[1:3, 1:3]
#> 1 2
#> 2 0.5385165
#> 3 0.5099020 0.3000000
Or we can provide different indices for rows and columns in which case it returns a dense matrix:
D[2:3, 7:12]
#> 7 8 9 10 11 12
#> 2 0.5099020 0.4242641 0.5099020 0.1732051 0.8660254 0.4582576
#> 3 0.2645751 0.4123106 0.4358899 0.3162278 0.8831761 0.3741657
The subsetting operation is fully parallelized using the RcppParallel package. It is also memory efficient as it does not copy the original distance matrix.
The medoid of a sample is the observation that minimizes the sum of
distances to all other observations. The find_medoids()
function
computes the medoid of a sample for a given distance. It takes advantage
of the RcppParallel package to compute the medoid in parallel.
find_medoids(D)
#> [1] 62
If the memberships
argument is provided, it returns the medoid for
each cluster.
find_medoids(D, memberships = as.factor(rep(1:3, each = 50L)))
#> 1 2 3
#> 8 97 113
- Pass a list instead of a matrix to be more general?
- Use Arrow parquet format to store distance matrix in multiple files when sample size exceeds 10,000 or something like that.
- Use Arrow connection to read in large data.
- Add Progress bar.