Statistics library for non-statistical people
If you're into statistics then PHP will not be your language of choice (try R instead) but if for any reason you, a non-statistician, need to do some stats then this library aims to provide a simple set of methods for common statistical functions.
By design, with the exception of statistical tests, functions generally accept a single series of data at a time. This is to keep the library simple to use
Many of the methods in this library are available from the Statistics Extension, however this is not included in PHP by default. If possible, I'd recommend using this extension rather than my stats library.
- Install with Composer:
composer require richjenks/stats
- Include autoloader:
require 'vendor/autoload.php';
- All static methods are available from the
RichJenks\Stats\Stats
class
<?php
require 'vendor/autoload.php';
use RichJenks\Stats\Stats;
echo Stats::mean([1, 2, 3]);
// 2
Stats will generally return either a
float
or anarray
, whichever is most appropriate for the function
Calculates the mean/average of given data:
Stats::mean([1, 2, 3]);
// 2
Stats::mean([15, 1000, 68.5, 9]);
// 273.125
The
average
function aliasesmean
, e.g.Stats::average([1, 2, 3]);
also returns2
Calculates the median (middle value) of given data:
Stats::median([1, 2, 3, 4]);
// 2.5
Stats::median([3.141, 1.618, 1.234]);
// 1.618
Calculates the mode(s) — most common value(s) — of given data:
Stats::mode([1, 2, 2, 3]);
// [2]
`Stats::mode([1, 2, 2, 3, 3]);
// [2, 3]
This function always return an array because it is able to handle multi-modal data and an empty array would mean there is no mode
Constructs a sorted array of frequencies for each value in a series:
Stats::frequencies([1, 2, 3]);
// [
// 1 => 1,
// 2 => 1,
// 3 => 1,
// ]
Stats::frequencies([10, 20, 20]);
// [
// 20 => 2,
// 10 => 1,
// ]
Determines the range (highest minus lowest) of given data:
Stats::range([1, 9]);
// 8
Stats::range([-41, 1.61803]);
// 42.61803
These functions calculate:
- Variance: square of average variation from the mean
- Standard Deviation: average variation from the mean (square root of Variance)
$data = [1, 2, 3, 4, 5];
Stats::variance($data);
// 2.5
Stats::sd($data);
// 1.5811388301
The deviations
function is also available if you require the deviations for each individual value, for example:
Stats::deviations([1, 2, 3, 4, 5]);
// [
// 1 => 4,
// 2 => 1,
// 3 => 0,
// 4 => 1,
// 5 => 4,
// ]
Stats::deviations([42, 75, 101, 22.5, 18]);
// [
// 42 => 94.09,
// 75 => 542.89,
// 101 => 2430.49,
// 22.5 => 852.64,
// 18 => 1135.69,
// ]
Sample
is the default mode for Variance and Standard Deviation but if you're unsure of the effect this decision has on your data then you probably don't need it and can skip this section.
Definitions
Population Every subject applicable, e.g. people who wear glasses or non-extinct species of frog
Sample The subset of subjects for which data is available, e.g. 100 glass-wearing subjects or a dozen species of frog
You can optionally pass the constants Stats::Sample
or Stats::POPULATION
as second parameters to determine whether your data is for a sample or a whole population:
$data = [1, 2, 3, 4, 5];
Stats::variance($data, Stats::POPULATION);
// 2
Stats::sd($data, Stats::POPULATION);
// 1.4142135624
Estimates how well the sample mean approximates the population mean:
Stats::sem([1, 2, 3, 4, 5]);
// 0.70710678118655
These functions calculate the data required to construct a Box Plot which, when you understand what each data point means, is a concise way of displaying and comparing data sets.
Calculates Quartiles 0—4, where:
- 0 is the lowest data point
- 1 is Q¹
- 2 is Q² (the median)
- 3 is Q³
- 4 is the highest data point
Stats::quartiles([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]);
// [
// 0 => 1,
// 1 => 3.5,
// 2 => 6.5,
// 3 => 9.5,
// 4 => 12,
// ]
Stats::quartiles([839, 560, 607, 828, 875, 805, 646, 450, 930, 443])
// [
// 0 => 443,
// 1 => 560,
// 2 => 725.5,
// 3 => 839,
// 4 => 930,
// ]
Calculates the range between Q¹ and Q³ (the middle 50% of data):
Stats::iqr([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]);
// 6
Stats::iqr([839, 560, 607, 828, 875, 805, 646, 450, 930, 443])
// 279
Determines which values in a series are outliers (too far from the other values so sometimes omitted from the data set, possibly due to experimental error):
Stats::outliers([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]);
// []
Stats::outliers([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 999])
// [999]
Determines which values in a series are not outliers, i.e. removes outliers:
Stats::inliers([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 999])
// [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Determines the lower and upper limit for identifying outliers:
Stats::whiskers([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 999])
// ['lower' => -6, 'upper' => 18]
All percentile functions accept an optional additional parameter for rounding that works as follows:
- If omitted, percentages are rounded to the nearest whole
- If a positive integer, percentages are rounded to that many decimal places
- If a negative integer (e.g.
-1
), percentages are not rounded
Determines the percentile of each value:
// Closest Rank
Stats::percentiles([15, 20, 35, 40, 50]);
// [
// 15 => 0,
// 20 => 14,
// 35 => 57,
// 40 => 71,
// 50 => 100,
// ]
Determines the value closest to the given percentile:
Stats::percentile([15, 20, 35, 40, 50], 75);
// [
// 'value' => 40,
// 'percentile' => 71,
// ]
Determines the values that fall in the given percentile, i.e. the lowest x% of all values:
Stats::intrapercentile([15, 20, 35, 40, 50], 60);
// [
// 15 => 0,
// 20 => 14,
// 35 => 57,
// ]
CLI usage is supported via the included scli
(Stats Command Line Interface) file and simply expects the name of the required method followed by its arguments:
./scli mean 1 2 3
# 2
./scli inliers 1 2 3 4 5 999
# 1,2,3,4,5
In cases where the result is a set (i.e. an array) it is presented as comma-separated
phpunit --bootstrap Stats.php tests/StatsTest