We often see Heckman’s two steps in accounting literature. But how to do it in Stata?
The two steps refer to the following two regressions:
Outcome equation: y = X × b1 + u1
Selection equation: Dummy = Z × b2 + u2
The selection equation must contain at least one variable that is not in the outcome equation.
The selection equation must be estimated using Probit. An intuitive way to do Heckman’s two steps is to estimate the selection equation first. Then include inverse mills ratio (IMR) derived from the selection equation in the outcome equation. In other words, run two regressions, one after the other.
Stata command for the selection equation:
probit Dummy X
(using both observations that are selected into the sample and observations that are not selected into the sample, i.e., Dummy
= 1 or Dummy
= 0)
Note vce
option (i.e., standard, robust or clustered standard errors, among others) will not change the resultant IMR.
Next, calculate IMR immediately:
predict probitxb, xb
ge pdf = normalden(probitxb)
ge cdf = normal(probitxb)
ge imr = pdf/cdf
Finally, include imr
in the outcome equation:
reg y X imr, vce(specified_vcetype)
(using observations that are selected into the sample only)
Note the first and the second regression use different numbers of observations.
However, this is not over. I find the first Probit regression sometimes causes missing IMR. For example, even if I have 100 observations with required Dummy
and X
data, I may only get IMR for 60 observations using this step-by-step method. I have not figured out why.
I then note that Stata in fact provides an all-in-one method to estimate both the selection equation and the outcome equation in one command heckman
:
heckman y X, select(Dummy = Z), twostep first mills(imr) vce(specfied_vcetype)
I recommend using twostep
option of the heckman
command. This option will produce the same results with the step-by-step method. But this option may reduce the number of available vce
types. In addition, the specified vce
option only applies to the outcome equation and has no effect on the selection equation.
In this all-in-one method, we must pool together both observations that are selected into the sample and observations that are not selected into the sample, in which Dummy
is 1 or 0 for all observations and y
and X
are missing for observations that are not selected into the sample. A benefit of this all-in-one method is that the weird missing-IMR issue will not appear.
I do have a closer look at missing IMR from the step-by-step method. They all have an extremely small value in the all-in-one method. I find that the step-by-step method has greater flexibility. Thus, if we want to use the step-by-step method but encounter the weird missing-IMR issue, it seems safe to just set missing IMR as zero.
Any comment is welcome.