二次啟航: [R] What does the R-squared of the Fixed Effects model summary mean in R?

I used R to implement the Fixed Effects regression model on panel data. Basically there are three major packages for FE model, lm, plm, and lfe. I want to evaluate the R-squared of these Fixed Effects model outputs as well as the F-statistics for significance. I found that lm produces one R-sq, plm produces one R-sq, and lfe produces two R-sqs. However, if you really look into them, they are all different! What the heck is going on?

According to the definitions from STATA, there are three types of R-squared of Fixed Effects model, within, between, and overall. Within R-sq means how much of the variation in the dependent variable within each entity group is captured by the model. Between R-sq represents how much of the variation in the dependent variable between each entity group is captured by the model. You can see the same structure in plm package, where you will define the model ("within" or "between") before fitting. This shows us, we should not just look at one R-squared to determine the model performance!

lfe package, instead, provides full and proj R-sq. I was able to manually calculate and reproduce lfe's full and proj R-sq using the model fit from the standard lm package. That said, I am quite certain that the full R-sq is straightforward, meaning R-sq of all pairs of predicted values and original values.

At the same time, lfe's proj R-squared is also identical to the so-called within R-squared (definitions from STATA), which is the default reported R-squared in the plm package. A reminder: plm is not just for FE model, so it will only output the correct R-sq when you input model="within", another way saying FE model. If you put "pooling" for example, it is the normal OLS, I think. Therefore, I truly think lfe is more straightforward and should be used for FE model. It is not just because their definitions of the two R-sq are easy to understand, but also this package is for FE model so it is difficult to confuse with other random-effects or mixed-effects models (plm is really confusing and the documentation is quite messy).

Below are my own calculations for full and proj R-sq.

fe_lm_mod <- lm(formula = "y ~ x1 + x2 + entity - 1", data = dataframe)
## Calculate prediction
y_predict <- predict(fe_lm_mod, newdata = dataframe)
y_original <- dataframe$y

# Get the valid values indices
notmiss <- which((!is.na(y_predict)) & (!is.na(y_original))) 

# Residiual sum of squares
SSres <- sum((y_original[notmiss] - y_predict[notmiss])**2)

# Calculate full R2
SStot_full <- sum((y_original[notmiss] - mean(y_original[notmiss]))**2)

### get the demean. The within finds the total sum of squares on the demeaned outcome variable. 
### References
# https://stats.stackexchange.com/questions/262246/difference-of-r2-between-ols-with-individual-dummies-to-panel-fixed-effect-mo
demeaned_y <- y_original[notmiss] - tapply(y_original[notmiss],dataframe$entity[notmiss],mean)[dataframe$entity][notmiss]
# Calculate within R2
SStot_within <- sum((demeaned_y-mean(demeaned_y))^2)

print(paste("calculated full R2", 1 - SSres/SStot_full))
print(paste("calculated within R2", 1 - SSres/SStot_within))

After reading STATA manual Page 10 briefly, I think the full R-sq in lfe and overall R-sq in STATA are the same idea. I see some people said overall R-sq is a weighted average of within and between R-sq, but I did not see any supporting evidence for this statement. I only see that both overall and full R-sq are directly calculated from the pairs of predicted y and original y.

For between R-sq, I think the plm package with model="between" may produce between R-sq, but I am not very sure. One can try to calculate it based on the STATA manual, like what I did for full and within R-sq above.

In terms of the coefficients and their significance, I think plm use lm's coefficients and stats anyway. I also check lfe's coefficients, all of them are the same. It doesn't matter which package you use, but I will recommend use lfe for the reasons above.

So far I made a summary for the R-sq outputs (to be continued):

lm R-sq: not good for Fixed Effects model, cannot reproduce
lfe "full" R-sq: R-sq for all pairs predicted y and original y, may also be called as "overall" R-sq
lfe "proj" R-sq: "within" R-sq: how much of the variation in the dependent variable within each entity group is captured by the model
plm model="within" R-sq: same as 3.
plm model="between" R-sq: "between" R-sq: how much of the variation in the dependent variable between each entity group is captured by the model
plm model="pooling" R-sq: not good for Fixed Effects model. This is the standard OLS R-sq. It is not a Fixed Effects model R-sq.
lm & lfe is easy to use for FE model! If "between" R-sq is not required, you don't need to use plm.

二次啟航

Pages

Monday, September 27, 2021

[R] What does the R-squared of the Fixed Effects model summary mean in R?

No comments:

Post a Comment