Extract summary statistics of R package structure and functionality. Not
all statistics of course, but a good go at balancing insightful
statistics while ensuring computational feasibility. pkgstats
is a
static code analysis tool, so is generally very fast (a few seconds at
most for very large packages). Installation is described in a separate
vignette.
Statistics are derived from these primary sources:
- Numbers of lines of code, documentation, and white space (both between and within lines) in each directory and language
- Summaries of package
DESCRIPTION
file and related package meta-statistics - Summaries of all objects created via package code across multiple
languages and all directories containing source code (
./R
,./src
, and./inst/include
). - A function call network derived from function definitions obtained
from the code tagging library,
ctags
, and references (“calls”) to those obtained from another tagging library,gtags
. This network roughly connects every object making a call (asfrom
) with every object being called (to
). - An additional function call network connecting calls within R functions to all functions from other R packages.
The primary function,
pkgstats()
,
returns a list of these various components, including full data.frame
objects for the final three components described above. The statistical
properties of this list can be aggregated by the pkgstats_summary()
function,
which returns a data.frame
with a single row of summary statistics.
This function is demonstrated below, including full details of all
statistics extracted.
The following code demonstrates the output of the main function,
pkgstats
, using an internally bundled .tar.gz
“tarball” of this
package. The system.time
call demonstrates that the static code
analyses of pkgstats
are generally very fast.
library (pkgstats)
tarball <- system.file ("extdata", "pkgstats_9.9.tar.gz", package = "pkgstats")
system.time (
p <- pkgstats (tarball)
)
## user system elapsed
## 1.701 0.124 1.802
names (p)
## [1] "loc" "vignettes" "data_stats" "desc"
## [5] "translations" "objects" "network" "external_calls"
The result is a list of various data extracted from the code. All except
for objects
and network
represent summary data:
p [!names (p) %in% c ("objects", "network", "external_calls")]
## $loc
## # A tibble: 3 × 12
## # Groups: language, dir [3]
## language dir nfiles nlines ncode ndoc nempty nspaces nchars nexpr ntabs
## <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <int>
## 1 C++ src 3 365 277 21 67 933 7002 1 0
## 2 R R 19 3741 2698 536 507 27575 94022 1 0
## 3 R tests 7 348 266 10 72 770 6161 1 0
## # … with 1 more variable: indentation <int>
##
## $vignettes
## vignettes demos
## 0 0
##
## $data_stats
## n total_size median_size
## 0 0 0
##
## $desc
## package version date license
## 1 pkgstats 9.9 2022-05-12 11:41:22 GPL-3
## urls
## 1 https://docs.ropensci.org/pkgstats/,\nhttps://github.com/ropensci-review-tools/pkgstats
## bugs aut ctb fnd rev ths
## 1 https://github.com/ropensci-review-tools/pkgstats/issues 1 0 0 0 0
## trl depends imports
## 1 0 NA brio, checkmate, dplyr, fs, igraph, methods, readr, sys, withr
## suggests
## 1 hms, knitr, pbapply, pkgbuild, Rcpp, rmarkdown, roxygen2, testthat, visNetwork
## enhances linking_to
## 1 NA cpp11
##
## $translations
## [1] NA
The various components of these results are described in further detail in the main package vignette.
A summary of the pkgstats
data can be obtained by submitting the
object returned from pkgstats()
to the pkgstats_summary()
function:
s <- pkgstats_summary (p)
This function reduces the result of the pkgstats()
function
to a single line with 95 entries, represented as a data.frame
with one
row and that number of columns. This format is intended to enable
summary statistics from multiple packages to be aggregated by simply
binding rows together. While 95 statistics might seem like a lot, the
pkgstats_summary()
function
aims to return as many usable raw statistics as possible in order to
flexibly allow higher-level statistics to be derived through combination
and aggregation. These 95 statistics can be roughly grouped into the
following categories (not shown in the order in which they actually
appear), with variable names in parentheses after each description. Some
statistics are summarised as comma-delimited character strings, such as
translations into human languages, or other packages listed under
“depends”, “imports”, or “suggests”. This enables subsequent analyses of
their contents, for example of actual translated languages, or both
aggregate numbers and individual details of all package dependencies, as
demonstrated immediately below.
Package Summaries
- name (
package
) - Package version (
version
) - Package date, as modification time of
DESCRIPTION
file where not explicitly stated (date
) - License (
license
) - Languages, as a single comma-separated character value
(
languages
), and excludingR
itself. - List of translations where package includes translations files,
given as list of (spoken) language codes (
translations
).
Information from DESCRIPTION
file
- Package URL(s) (
url
) - URL for BugReports (
bugs
) - Number of contributors with role of author (
desc_n_aut
), contributor (desc_n_ctb
), funder (desc_n_fnd
), reviewer (desc_n_rev
), thesis advisor (ths
), and translator (trl
, relating to translation between computer and not spoken languages). - Comma-separated character entries for all
depends
,imports
,suggests
, andlinking_to
packages.
Numbers of entries in each the of the last two kinds of items can be
obtained from by a simple strsplit
call, like this:
deps <- strsplit (s$suggests, ", ") [[1]]
length (deps)
## [1] 9
print (deps)
## [1] "hms" "knitr" "pbapply" "pkgbuild" "Rcpp"
## [6] "rmarkdown" "roxygen2" "testthat" "visNetwork"
Numbers of files and associated data
- Number of vignettes (
num_vignettes
) - Number of demos (
num_demos
) - Number of data files (
num_data_files
) - Total size of all package data (
data_size_total
) - Median size of package data files (
data_size_median
) - Numbers of files in main sub-directories (
files_R
,files_src
,files_inst
,files_vignettes
,files_tests
), where numbers are recursively counted in all sub-directories, and whereinst
only counts files in theinst/include
sub-directory.
Statistics on lines of code
- Total lines of code in each sub-directory (
loc_R
,loc_src
,loc_ins
,loc_vignettes
,loc_tests
). - Total numbers of blank lines in each sub-directory (
blank_lines_R
,blank_lines_src
,blank_lines_inst
,blank_lines_vignette
,blank_lines_tests
). - Total numbers of comment lines in each sub-directory
(
comment_lines_R
,comment_lines_src
,comment_lines_inst
,comment_lines_vignettes
,comment_lines_tests
). - Measures of relative white space in each sub-directory
(
rel_space_R
,rel_space_src
,rel_space_inst
,rel_space_vignettes
,rel_space_tests
), as well as an overall measure for theR/
,src/
, andinst/
directories (rel_space
). - The number of spaces used to indent code (
indentation
), with values of -1 indicating indentation with tab characters. - The median number of nested expression per line of code, counting
only those lines which have any expressions (
nexpr
).
Statistics on individual objects (including functions)
These statistics all refer to “functions”, but actually represent more general “objects,” such as global variables or class definitions (generally from languages other than R), as detailed below.
- Numbers of functions in R (
n_fns_r
) - Numbers of exported and non-exported R functions
(
n_fns_r_exported
,n_fns_r_not_exported
) - Number of functions (or objects) in other computer languages
(
n_fns_src
), including functions in bothsrc
andinst/include
directories. - Number of functions (or objects) per individual file in R and in all
other (
src
) directories (n_fns_per_file_r
,n_fns_per_file_src
). - Median and mean numbers of parameters per exported R function
(
npars_exported_mn
,npars_exported_md
). - Mean and median lines of code per function in R and other languages,
including distinction between exported and non-exported R functions
(
loc_per_fn_r_mn
,loc_per_fn_r_md
,loc_per_fn_r_exp_m
,loc_per_fn_r_exp_md
,loc_per_fn_r_not_exp_mn
,loc_per_fn_r_not_exp_m
,loc_per_fn_src_mn
,loc_per_fn_src_md
). - Equivalent mean and median numbers of documentation lines per
function (
doclines_per_fn_exp_mn
,doclines_per_fn_exp_md
,doclines_per_fn_not_exp_m
,doclines_per_fn_not_exp_md
,docchars_per_par_exp_mn
,docchars_per_par_exp_m
).
Network Statistics
The full structure of the network
table is described below, with
summary statistics including:
- Number of edges, including distinction between languages (
n_edges
,n_edges_r
,n_edges_src
). - Number of distinct clusters in package network (
n_clusters
). - Mean and median centrality of all network edges, calculated from
both directed and undirected representations of network
(
centrality_dir_mn
,centrality_dir_md
,centrality_undir_mn
,centrality_undir_md
). - Equivalent centrality values excluding edges with centrality of zero
(
centrality_dir_mn_no0
,centrality_dir_md_no0
,centrality_undir_mn_no0
,centrality_undir_md_no
). - Numbers of terminal edges (
num_terminal_edges_dir
,num_terminal_edges_undir
). - Summary statistics on node degree (
node_degree_mn
,node_degree_md
,node_degree_max
)
External Call Statistics
The final column in the result of the pkgstats_summary()
function
summarises the external_calls
object detailing all calls make to
external packages (including to base and recommended packages). This
summary is also represented as a single character string. Each package
lists total numbers of function calls, and total numbers of unique
function calls. Data for each package are separated by a comma, while
data within each package are separated by a colon.
s$external_calls
## [1] "base:447:78,brio:7:1,dplyr:7:4,fs:4:2,graphics:10:2,hms:1:1,igraph:3:3,pbapply:1:1,pkgstats:99:60,readr:8:5,stats:16:2,sys:13:1,tools:2:2,utils:10:7,visNetwork:3:2,withr:5:1"
This structure allows numbers of calls to all packages to be readily extracted with code like the following:
calls <- do.call (
rbind,
strsplit (strsplit (s$external_call, ",") [[1]], ":")
)
calls <- data.frame (
package = calls [, 1],
n_total = as.integer (calls [, 2]),
n_unique = as.integer (calls [, 3])
)
print (calls)
## package n_total n_unique
## 1 base 447 78
## 2 brio 7 1
## 3 dplyr 7 4
## 4 fs 4 2
## 5 graphics 10 2
## 6 hms 1 1
## 7 igraph 3 3
## 8 pbapply 1 1
## 9 pkgstats 99 60
## 10 readr 8 5
## 11 stats 16 2
## 12 sys 13 1
## 13 tools 2 2
## 14 utils 10 7
## 15 visNetwork 3 2
## 16 withr 5 1
The two numeric columns respectively show the total number of calls made to each package, and the total number of unique functions used within those packages. These results provide detailed information on numbers of calls made to, and functions used from, other R packages, including base and recommended packages.
Finally, the summary statistics conclude with two further statistics of
afferent_pkg
and efferent_pkg
. These are package-internal measures
of afferent and efferent
couplings
between the files of a package. The afferent couplings (ca
) are
numbers of incoming calls to each file of a package from functions
defined elsewhere in the package, while the efferent couplings (ce
)
are numbers of outgoing calls from each file of a package to functions
defined elsewhere in the package. These can be used to derive a measure
of “internal package instability” as the ratio of efferent to total
coupling (ce / (ce + ca)
).
There are many other “raw” statistics returned by the main pkgstats()
function which are not represented in pkgstats_summary()
. The main
package
vignette
provides further detail on the full results.
The following sub-sections provide further detail on the objects
,
network
, and external_call
items, which could be used to extract
additional statistics beyond those described here.
Please note that this package is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.