二次啟航: Quantile

Many measures of the empirical distribution rely on quantiles. According to Wiki, quantiles are cut points dividing the distributions of the observations into equal sized groups. We can understand this concept easily by depicting the cut points on the probability distribution of the data. A sample quantile Qp is a value having a same unit as the data, which exceeds p (0<p<1) proportion of the data, or can be regarded as the p*100% percentile of the data. For example, median is also expressed as Q0.5, meaning the data point which exceeds 50% of the data.
The determination of quantiles requires the order statistics of the data. One of the definition is copied from order statistics:
Our goal is to find the value that is the fraction

p of the way through the (ordered) data set. We define the rank of the value that we are looking for as

(n−1)p+1. Note that the rank is a linear function of

p, and that the rank is 1 when

p=0 and

n when

p=1. But of course, the rank will not be an integer in general, so we let

k=⌊(n−1)p+1⌋, the integer part of the desired rank, and we let

t=[(n−1)p+1]−k, the fractional part of the desired rank. Thus,

(n−1)p+1=k+t where

k∈{1,2,…,n} and

t∈[0,1). So, using linear interpolation, we define the sample quantile of order

p to be

x[p]=x(k)+t[x(k+1)−x(k)]=(1−t)x(k)+tx(k+1)

However, this is only one of the nine ways to compute quantiles and not even the best one, R7 in Wikipedia. This is the result of computational load in the past (see this article). This article also discussed the best estimate method, R8 in Wikipedia. This is connected to Tukey plotting position formula through CDF, discussed later.
Sometime we want to compare two distributions, for example, we want to see if two empirical distributions have the common features, or would like to know if one empirical distribution can be fitted by a theoretical distribution. Histogram and ECDF have been widely used for fitting a theoretical distribution, while the results heavily rely on the bin width. Quantile-quantile plot is a more robust way to do the comparison. qq-plot is a scatterplot, with each coordinate pair defining the location of a point consists of a data value, and the corresponding estimate for that data value derived from the quantile function of the fitted distribution. Note that quantile function is the inverse of the cumulative distribution function, therefore the methods for plotting position for CDF is the inverse methods for quantiles estimation. This article about qq_plot is a very detailed and clear online material for understanding the basis of CDF matching.

二次啟航

Pages

Sunday, October 25, 2015

Quantile

No comments:

Post a Comment