Stata commands to do Heckman two steps

We often see Heckman’s two steps in accounting literature. But how to do it in Stata?

The two steps refer to the following two regressions:

Outcome equation: y = X × b1 + u1
Selection equation: Dummy = Z × b2 + u2

The selection equation must contain at least one variable that is not in the outcome equation.

The selection equation must be estimated using Probit. An intuitive way to do Heckman’s two steps is to estimate the selection equation first. Then include inverse mills ratio (IMR) derived from the selection equation in the outcome equation. In other words, run two regressions, one after the other.

Stata command for the selection equation:

probit Dummy X (using both observations that are selected into the sample and observations that are not selected into the sample, i.e., Dummy = 1 or Dummy = 0)

Note vce option (i.e., standard, robust or clustered standard errors, among others) will not change the resultant IMR.

Next, calculate IMR immediately:

predict probitxb, xb
ge pdf = normalden(probitxb)
ge cdf = normal(probitxb)
ge imr = pdf/cdf

Finally, include imr in the outcome equation:

reg y X imr, vce(specified_vcetype) (using observations that are selected into the sample only)

Note the first and the second regression use different numbers of observations.

However, this is not over. I find the first Probit regression sometimes causes missing IMR. For example, even if I have 100 observations with required Dummy and X data, I may only get IMR for 60 observations using this step-by-step method. I have not figured out why.

I then note that Stata in fact provides an all-in-one method to estimate both the selection equation and the outcome equation in one command heckman:

heckman y X, select(Dummy = Z), twostep first mills(imr) vce(specfied_vcetype)

I recommend using twostep option of the heckman command. This option will produce the same results with the step-by-step method. But this option may reduce the number of available vce types. In addition, the specified vce option only applies to the outcome equation and has no effect on the selection equation.

In this all-in-one method, we must pool together both observations that are selected into the sample and observations that are not selected into the sample, in which Dummy is 1 or 0 for all observations and y and X are missing for observations that are not selected into the sample. A benefit of this all-in-one method is that the weird missing-IMR issue will not appear.

I do have a closer look at missing IMR from the step-by-step method. They all have an extremely small value in the all-in-one method. I find that the step-by-step method has greater flexibility. Thus, if we want to use the step-by-step method but encounter the weird missing-IMR issue, it seems safe to just set missing IMR as zero.

Any comment is welcome.

This entry was posted in Stata. Bookmark the permalink.

5 Responses to Stata commands to do Heckman two steps

  1. Julio Galárraga says:

    Dear Kai Chen

    Have you run a heckman two step for survey data, as heckman two step command does not allow iweights or pweights, and svy: heckman is not allowed.

    Tks,

  2. Tina says:

    Dear Kai Chen,

    Do you have a suggestion how to export heckit results in two separat tables?
    (Using outreg 2 the parameters for the final stage and selection equation are reported side by side in columns.)
    I would be also excited for suggestions using export commands other than -outreg2-.

    Many thanks in advance!
    Kind Regards
    Tina

  3. Getalem Alemu says:

    Is it possible to use the same explanatory variable with the Heckman two-stage(probit & OLS) model at the same time?

    • Gijs says:

      Yes, but keep it to a minimum. the set of explanatory variables of the selection equation Z needs to be a superset of the set of the set of explanatory variables in the outcome equation (X). The more the two sets are alike, the more the inverse Mill’s ratio will be correlated with X and thus the larger the standard errors will be.

  4. amare wodaju says:

    Hello dears
    I face difficulties in analyzing a data using heckman’s two step model. when I read about heckmans two step model I understand that the selection model should have at least one explanatory variable that is not found in the outcome or the conditional model. But when I want to analyse my data using heckma’s two step I get one more variable in the outcome model that is not found and even have no meaning if it includes in selection model and I get it outside the knowldege that I got from my reading. So please help me to address the problem I faced. can I analyse on the stated situation or not?
    The other question is how can I know if there is a correlation between the error terms of the two equations and how can I know if there is sample selection bias or not ?
    Thank you!!!

Leave a Reply

Your email address will not be published. Required fields are marked *