We often see Heckman’s two steps in accounting literature. But how to do it in Stata?
The two steps refer to the following two regressions:
Outcome equation: y = X × b1 + u1
Selection equation: Dummy = Z × b2 + u2
The selection equation must contain at least one variable that is not in the outcome equation.
The selection equation must be estimated using Probit. An intuitive way to do Heckman’s two steps is to estimate the selection equation first. Then include inverse mills ratio (IMR) derived from the selection equation in the outcome equation. In other words, run two regressions, one after the other.
Stata command for the selection equation:
probit Dummy X
(using both observations that are selected into the sample and observations that are not selected into the sample, i.e., Dummy
= 1 or Dummy
= 0)
Note vce
option (i.e., standard, robust or clustered standard errors, among others) will not change the resultant IMR.
Next, calculate IMR immediately:
predict probitxb, xb
ge pdf = normalden(probitxb)
ge cdf = normal(probitxb)
ge imr = pdf/cdf
Finally, include imr
in the outcome equation:
reg y X imr, vce(specified_vcetype)
(using observations that are selected into the sample only)
Note the first and the second regression use different numbers of observations.
However, this is not over. I find the first Probit regression sometimes causes missing IMR. For example, even if I have 100 observations with required Dummy
and X
data, I may only get IMR for 60 observations using this step-by-step method. I have not figured out why.
I then note that Stata in fact provides an all-in-one method to estimate both the selection equation and the outcome equation in one command heckman
:
heckman y X, select(Dummy = Z), twostep first mills(imr) vce(specfied_vcetype)
I recommend using twostep
option of the heckman
command. This option will produce the same results with the step-by-step method. But this option may reduce the number of available vce
types. In addition, the specified vce
option only applies to the outcome equation and has no effect on the selection equation.
In this all-in-one method, we must pool together both observations that are selected into the sample and observations that are not selected into the sample, in which Dummy
is 1 or 0 for all observations and y
and X
are missing for observations that are not selected into the sample. A benefit of this all-in-one method is that the weird missing-IMR issue will not appear.
I do have a closer look at missing IMR from the step-by-step method. They all have an extremely small value in the all-in-one method. I find that the step-by-step method has greater flexibility. Thus, if we want to use the step-by-step method but encounter the weird missing-IMR issue, it seems safe to just set missing IMR as zero.
Any comment is welcome.
Dear Kai Chen
Have you run a heckman two step for survey data, as heckman two step command does not allow iweights or pweights, and svy: heckman is not allowed.
Tks,
Dear Kai Chen,
Do you have a suggestion how to export heckit results in two separat tables?
(Using outreg 2 the parameters for the final stage and selection equation are reported side by side in columns.)
I would be also excited for suggestions using export commands other than -outreg2-.
Many thanks in advance!
Kind Regards
Tina
Is it possible to use the same explanatory variable with the Heckman two-stage(probit & OLS) model at the same time?
Yes, but keep it to a minimum. the set of explanatory variables of the selection equation Z needs to be a superset of the set of the set of explanatory variables in the outcome equation (X). The more the two sets are alike, the more the inverse Mill’s ratio will be correlated with X and thus the larger the standard errors will be.
Hi Gijs,
Why does Z need to be a superset of X? What is violated if there is a variable in X that is not in Z – does this mess up the inverse Mills ratio? It seems that some important predictors of the outcome might only be measurable for people who select into treatment, so it would not be possible to include these variables in the selection Probit. Or in this scenario, would a different model be appropriate? Thanks.
Hello dears
I face difficulties in analyzing a data using heckman’s two step model. when I read about heckmans two step model I understand that the selection model should have at least one explanatory variable that is not found in the outcome or the conditional model. But when I want to analyse my data using heckma’s two step I get one more variable in the outcome model that is not found and even have no meaning if it includes in selection model and I get it outside the knowldege that I got from my reading. So please help me to address the problem I faced. can I analyse on the stated situation or not?
The other question is how can I know if there is a correlation between the error terms of the two equations and how can I know if there is sample selection bias or not ?
Thank you!!!
How can we interpret IMR result from the data analysed?
Dear Kai Chen,
I also encountered the missing IMR issue when using the step-by-step method. I checked all the missing IMRs, and found that they are either extremely large values (larger than 6), or extremely small values (smaller than 0.0000000001) in the all-in-one method.
So I am not sure if it is safe to set all missing IMRs as zero.
I feel it is caused by the “predict” command followed by the probit regression, as the predicted outcome variable has exactly the same missing issue as the IMR.
But I do not know how to fix this issue. Hope my findings can inspire you a little bit.
Best,
Sichen
Good day,
I am doing a study with 3 outcome variables, of which one of those variables is also the selection variable which is current smoking status to be specific. the other outcome variables are age at smoking initiation and smoking other tobacco products. Now the question is do i run the heck probit for each outcome variable separately. meaning do i repeat all the steps 3 times? Ols, 3 times, imr, 3 times etc.