二次啟航: October 2015

Wednesday, October 28, 2015

[R]Reset the environment variable for R in El Capitan

The issue is I can't call R in terminal after El Capitan has been installed. This is the result of the newly banned writing permission to usr/bin. See this post: R in El Capitan public beta (new security model).

1. Find the R binary.
I have the Rstudio, where I can open R console. In Rstudio, use the following command to track where R binary is or RHOME.
If you can't track your directory, you can also look for R in this path. My system is El Capitan.

2. Modify the path profile.
Add the directory to your system path. In terminal:
$ sudo nano /etc/paths
Enter your password, when prompted.
Go to the bottom of the file, and enter the path you found in the first step.
Hit control-x to quit.
Enter “Y” to save the modified buffer.
Hit enter t confirm when prompted by "File name to write".

Major reference:
Add to the PATH on Mac OS X 10.8 Mountain Lion
Set environment variables on Mac OS X Lion

Sunday, October 25, 2015

Quantile

Many measures of the empirical distribution rely on quantiles. According to Wiki, quantiles are cut points dividing the distributions of the observations into equal sized groups. We can understand this concept easily by depicting the cut points on the probability distribution of the data. A sample quantile Qp is a value having a same unit as the data, which exceeds p (0<p<1) proportion of the data, or can be regarded as the p*100% percentile of the data. For example, median is also expressed as Q0.5, meaning the data point which exceeds 50% of the data.
The determination of quantiles requires the order statistics of the data. One of the definition is copied from order statistics:
Our goal is to find the value that is the fraction

p of the way through the (ordered) data set. We define the rank of the value that we are looking for as

(n−1)p+1. Note that the rank is a linear function of

p, and that the rank is 1 when

p=0 and

n when

p=1. But of course, the rank will not be an integer in general, so we let

k=⌊(n−1)p+1⌋, the integer part of the desired rank, and we let

t=[(n−1)p+1]−k, the fractional part of the desired rank. Thus,

(n−1)p+1=k+t where

k∈{1,2,…,n} and

t∈[0,1). So, using linear interpolation, we define the sample quantile of order

p to be

x[p]=x(k)+t[x(k+1)−x(k)]=(1−t)x(k)+tx(k+1)

However, this is only one of the nine ways to compute quantiles and not even the best one, R7 in Wikipedia. This is the result of computational load in the past (see this article). This article also discussed the best estimate method, R8 in Wikipedia. This is connected to Tukey plotting position formula through CDF, discussed later.
Sometime we want to compare two distributions, for example, we want to see if two empirical distributions have the common features, or would like to know if one empirical distribution can be fitted by a theoretical distribution. Histogram and ECDF have been widely used for fitting a theoretical distribution, while the results heavily rely on the bin width. Quantile-quantile plot is a more robust way to do the comparison. qq-plot is a scatterplot, with each coordinate pair defining the location of a point consists of a data value, and the corresponding estimate for that data value derived from the quantile function of the fitted distribution. Note that quantile function is the inverse of the cumulative distribution function, therefore the methods for plotting position for CDF is the inverse methods for quantiles estimation. This article about qq_plot is a very detailed and clear online material for understanding the basis of CDF matching.

Wednesday, October 7, 2015

二次啟航

Pages

Wednesday, October 28, 2015

[R]Reset the environment variable for R in El Capitan

Sunday, October 25, 2015

Quantile

Friday, October 16, 2015

[Linux] Search all files containing a given pattern

Wednesday, October 7, 2015

第三年了