The purpose of this repository is to illustrate the many methods to estimate sample quantiles.
PythonISC
In brevi
The purpose of this repository is to illustrate the many methods to estimate sample quantiles. This arose while comparing results from NumPy with Minitab statistical software, which led to questions of why the difference. This is important to me because the differences in Q1 and Q3 lead to practically significant differences in estimates of the confidence intervals of Q2.
I use functions from NumPy and SciPy to estimate quantiles. I might have to code my own functions for the other methods for which functions do not exist.
Explanation of the twelve methods
Quantiles divide the range of a probability distribution into continuous intervals with equal probabilities, or divide the observations in a sample in the same way Wikipedia. A sample drawn from an unknown population requires estimating the quantiles. There are twelve known methods that commonly appear in statistical packages. Methods 1-3 are based on rounding. Methods 4-9 are based on linear interpolation.
The data are sorted in increasing order. Each method computes $Q_{\text{p}}$, the estimate for the $k^{th}$$q$-quantile, where $p = k/q$, from a sample of size $N$ by computing a real-valued index $h$. When $h$ is an integer, the $h^{th}$ smallest of the $N$ values, $x_h$, is the quantile estimate. Otherwise, a rounding or interpolation scheme is used to compute the quantile estimate from $h$, $x_{\lfloor \text{h}\rfloor}$, and $x_{\lceil \text{h}\rceil}$.
The same as R-1, but with averaging at discontinuities. When $p = 0$, use $x_1$.When $p = 1$, use $x_N$.
3
R-3, SAS-2
$Np$
$x_{\lfloor \text{h}\rceil}$
The observation numbered closest to $Np$ (piecewise linear function). It is also called the nearest even-order statistic. Here, $\lfloor \text{h}\rceil$ indicates rounding to the nearest integer, choosing the even integer in the case of a tie. When $p \leq \frac{\frac{1}{2}}{N}$, use $x_1$.
$\frac{k - \frac{1}{2}}{n}$ Piecewise linear function where the knots are the values midway through the steps of the emperical distribution function. When $p \lt \frac{\frac{1}{2}}{N}$, use $x_1$. When $p \geq \frac{(N - \frac{1}{2}}{N})$, use $x_N$.
$\frac{k}{n +1}$ Linear interpolation of the expectations for the order statistics for the uniform distribution on [0,1]. That is, it is the linear interpolation between points $p_h$ and $x_h$, where $p_h = \frac{h}{N+ 1}$ is the probability that the last of $(N + 1)$ randomly drawn values will not exceed the $h^\text{th}$ smallest of the first $N$ randomly drawn values. When $p \lt \frac{1}{N + 1}$, use $x_1$. When $p \geq \frac{N}{N + 1}$, use $x_N$.
$p(k) = \frac{k - 1}{n - 1}\\ p(k) = \text{mode[F(x[k])]}$ Linear interpolation of the modes for the order statistics for the uniform distribution on [0,1]. When $p = 1$, use $x_N$.
$p(k) = \frac{k - \frac{1}{3}}{n + \frac{1}{3}}$ Then $p(k) \text{ ~ median}{[F(x[k])]}.$ The resulting quantile estimtes are approximately median-unbiased regardless of the distribution of x. Linear interpolation of the approximate medians for order statistics. When $p \lt \frac{\frac{2}{3}}{N + \frac{1}{3}}$, use $x_1$. When $p \geq \frac{(N - \frac{1}{3})}{(N + \frac{1}{3})}$, use $x_N$.
$p(k) = \frac{k - \frac{3}{8}}{n + \frac{1}{4}}$ Blom. The resulting quantile estimates are approximately unbiased for the expected order statistics if $x$ is normally distributed. When $p \lt \frac{\frac{5}{8}}{N + \frac{1}{4}}$, use $x_1$. When $p \geq \frac{N - \frac{3}{8}}{N + \frac{1}{4}}$, use $x_N$.
10
(0.4,0.4)
Cunnane's approximately quantile unbiased definition.
11
Filliben's estimate.
12
(0.35,0.35)
APL, used with PWM.
Software
As shown above, R (version 2.0.0 onwards) implements methods 1-9. SciPy implements methods 4-9, 10, and a method called APL (need info on this).scipy.stats.mstats.mquantiles
Hyndman, Rob J. and Yanan Fan. "Sample Quantiles in Statistical Packages." The American Statistician Vol. 50, No. 4 (Nov. 1996): 361-365. JSTOR 2684934.
McGill, Robert, John W. Tukey, and Wayne A. Larsen. 1978. "Variations of Box Plots." *The American Statistician 21 (February 1978), no. 1; 12-16. https://www.jstor.org/stable/2683468.