about me
game theory
... more
Subscribe Weblog

Recently, R.J. Rummel looked at
coll01and concluded that if x1 and x2 are highly correlated, the estimate of one coefficient, say β1, could be large at the expense of of β2. He actually spoke of "stealing".

Bryan Caplan replied:
People who use statistics often talk as if multicollinearity (high correlations between [explanatory] variables) biases results. But it doesn't. Multicollinearity leads to big standard errors, but if your [explanatory] variables are highly correlated, they SHOULD be big! Intuitively, big standard errors mean that the effects of different variables are highly uncertain, and if your [explanatory] variables are highly correlated, highly uncertain is what you should be.
To put is more formally: If the true model contains two explanatory variables and a constant, the formula for the variance of the slope coefficients looks as follows:
coll02One can easily see that the coefficient's variance is increasing in the squared correlation coefficient between the two explanatory variables (r12). This is why correlated regressors could make all coefficients insignificantly different from zero (tested one at a time). But what's more interesting is that a positve correlation between the regressors leads to a negative correlation between the estimated coefficients (this is why the complete set of regressors can be jointly significant although individual regressors are insigificantly different from zero):
coll03Maybe that's what Rummel ment with "stealing". Rummel then suggests to run two bivariate regression for each explanatory variable. What is somehow forgotten in the whole discussion is that correlation among regressors is actualy the only reason why we run a multiple regression in the first place. We care about net effects! When the true model contains two explanatory variables but we only include one explanatory variable the resulting slope would be biased (see my story about the omitted variable bias). A valid critique would be that the two variables in our regression are actually generated by a third variable which we do not observe, i.e. they are proxies for the same thing. In this case our model (with two regressors) is wrong/misspecified. The only solution to this problem is 1. to use more data and 2. to think harder about how our world works.
Paul N (guest) meinte am 27. Sep, 08:18:
Good post. By the way, I wish there a job where I could do "2." all day long and get paid for it.