Stata commands to do Heckman two steps

We often see Heckman’s two steps in accounting literature. But how to do it in Stata?

The two steps refer to the following two regressions:

Outcome equation: y = X × b1 + u1
Selection equation: Dummy = Z × b2 + u2

The selection equation must contain at least one variable that is not in the outcome equation.

The selection equation must be estimated using Probit. An intuitive way to do Heckman’s two steps is to estimate the selection equation first. Then include inverse mills ratio (IMR) derived from the selection equation in the outcome equation. In other words, run two regressions, one after the other.

Stata command for the selection equation:

probit Dummy X (using both observations that are selected into the sample and observations that are not selected into the sample, i.e., Dummy = 1 or Dummy = 0)

Note vce option (i.e., standard, robust or clustered standard errors, among others) will not change the resultant IMR.

Next, calculate IMR immediately:

predict probitxb, xb
ge pdf = normalden(probitxb)
ge cdf = normal(probitxb)
ge imr = pdf/cdf

Finally, include imr in the outcome equation:

reg y X imr, vce(specified_vcetype) (using observations that are selected into the sample only)

Note the first and the second regression use different numbers of observations.

However, this is not over. I find the first Probit regression sometimes causes missing IMR. For example, even if I have 100 observations with required Dummy and X data, I may only get IMR for 60 observations using this step-by-step method. I have not figured out why.

I then note that Stata in fact provides an all-in-one method to estimate both the selection equation and the outcome equation in one command heckman:

heckman y X, select(Dummy = Z), twostep first mills(imr) vce(specfied_vcetype)

I recommend using twostep option of the heckman command. This option will produce the same results with the step-by-step method. But this option may reduce the number of available vce types. In addition, the specified vce option only applies to the outcome equation and has no effect on the selection equation.

In this all-in-one method, we must pool together both observations that are selected into the sample and observations that are not selected into the sample, in which Dummy is 1 or 0 for all observations and y and X are missing for observations that are not selected into the sample. A benefit of this all-in-one method is that the weird missing-IMR issue will not appear.

I do have a closer look at missing IMR from the step-by-step method. They all have an extremely small value in the all-in-one method. I find that the step-by-step method has greater flexibility. Thus, if we want to use the step-by-step method but encounter the weird missing-IMR issue, it seems safe to just set missing IMR as zero.

Any comment is welcome.

Posted in Stata | 5 Comments

The calculation of average credit rating using ratings from three rating agencies

I was doing something in Finance and wanted to calculate the average rounded credit rating. Basically, I need to translate textual grades (e.g., AAA, Baa) to a numerical value. I found a clue in the following paper:

Becker, B., and T. Milbourn. 2011. How did increased competition affect credit ratings? Journal of Financial Economics 101 (3):493-514.

See their Table 2 for an overview of the ratings levels for the three main rating agencies and the numerical value assignments used in their empirical work.

Posted in Data | 1 Comment

Stata commands to test equality of mean and median

DataCommand to Test Equality of MeanCommand to Test Equality of Median
Paired or matchedPaired t test:
ttest var1 = var2
Wilcoxon matched-pairs signed-rank test:
signrank var1 = var2

Sign test of matched pairs:
signtest var1 = var2
Unpaired or unmatchedTwo-sample t test:
ttest var, by(groupvar)
Wilcoxon rank-sum test or Mann_Whitney test:
ranksum var, by(groupvar)

K-sample equality-of-medians test:
median var, by(groupvar)

Please read this post for how to display the results in a ready-for-use format.

UCLA IDRE has posted an article (link) that may provide a bit more explanation. UCLA IDRE is a great resource for learning statistical analysis. A big thank you to them.

Posted in Stata | 2 Comments

Stata command to display combined Pearson and Spearman correlation matrix

Oftentimes we would like to display Pearson correlations below the diagonal and Spearman correlations above the diagonal. Two built-in commands, pwcorr and spearman, can do the job. However, we have to manually combine Stata output tables when producing the correlation table in the manuscript, which is time-consuming.

I find this fantastic module written by Daniel Klein. His command will return one table that combines Pearson and Spearman correlations and needs the fewest further edits. Thanks Daniel and please find his work here.

A sample command is as follows:

corsp varlist, pw sig

To install Daniel’s module, type ssc install corsp in Stata’s command window.

A good technical comparison of Pearson and Spearman correlations can be found here.

Posted in Stata | 1 Comment

Stata command to convert string GVKEY to numerical GVKEY or vice versa

The default type of GVEKY in Compustat is string. Sometimes, we need it to be a numerical type in Stata (e.g., when we want to use the super handy command tsset). The command to convert string GVKEY to numerical GVEKY is very simple:

destring gvkey, replace

The command to revert numerical GVKEY to string GVKEY with leading zeros is as follows:

tostring gvkey, replace format(%06.0f)

Posted in Stata | 1 Comment

Stata command to calculate the area under ROC curve

If we want to evaluate the predictive ability of a logit or probit model, Kim and Skinner (2012, JAE, Measuring securities litigation risk) suggest that

A better way of comparing the predictive ability of different models is to use the Receiver Operating Characteristic, or ROC curve (e.g., Hosmer and Lemeshow, 2000, Chapter 5). This curve ‘‘plots the probability of detecting a true signal (sensitivity) and false signal (1—specificity) for the entire range of possible cutpoints’’ (p. 160, our emphasis). The area under the ROC curve (denoted AUC) provides a measure of the model’s ability to discriminate. A value of 0.5 indicates no ability to discriminate (might as well toss a coin) while a value of 1 indicates perfect ability to discriminate, so the effective range of AUC is from 0.5 to 1.0. Hosmer-Lemeshow (2000, p. 162) indicate that AUC of 0.5 indicates no discrimination, AUC of between 0.7 and 0.8 indicates acceptable discrimination, AUC of between 0.8 and 0.9 indicates excellent discrimination, and AUC greater than 0.9 is considered outstanding discrimination.

The Stata command to report AUC is as follows:

logit y x1 x2 or probit y x1 x2

lroc, nograph

The most recent edition of the book Kim and Skinner refer to is Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken, NJ: Wiley.

A technical note from Stata: lroc requires that the current estimation results be from logistic, logit, probit, or ivprobit.

A side question: what’s the difference between logistic and logit regression? Nick Cox’s short answer is: “same thing with different emphases in reporting.” (something like one gives you the odds ratios, the other gives you the log of the odds ratios.)—thanks to a post on Stack Overflow.

Posted in Stata | 1 Comment

Stata commands to calculate skewness

Suppose we are going to calculate the skewness of 12 monthly returns. The 12 returns may be stored in a row (Figure 1) or in a column (Figure 2). This post discusses how to calculate the skewness in these two situations. Please note there are several formulae for skewness out there, which may yield different results. This post uses the formula that yields the same skewness as the Stata command sum var, detail reports.

Figure 1: Returns are stored in a row

Figure 2: Returns are stored in a column

If returns are stored in a row

Stata does not provide a command to calculate the skewness in this situation. The following Stata commands will do the job.

If returns are stored in a column

Stata provides a command to calculate skewness in this situation (egen and skewness). However, the computation is extremely slow if we have millions of observations. I would suggest calculating the skewness manually as follows:

 

 

Posted in Stata | Leave a comment

Use Python to download lawsuit data from Stanford Law School’s Securities Class Action Clearinghouse

[Update on 2019-07-07] I am grateful to Shiyu Chen, my research assistant, who did a very good job on not only web scraping the top-level table, but also extracting from the case summary page additional information (link to case summary page, case status, update date, case summary, case period start date, and case period end date). I post her Python program below with her permission.

[Original Post] Several papers borrow the litigation risk model supplied in Equation (3) of Kim and Skinner (2012, JAE, Measuring securities litigation risk). The logit model uses total asset, sales growth, stock return, stock return skewness, stock return standard deviation, and turnover to estimate a predicted value of litigation risk.  The measure of litigation risk is used by Billings and Cedergen (2015, JAE), Kerr and Ozel (2015, TAR), Bourveau, Lou, and Wang (2018, JAR), and Baginski, Campbell, Hinson, and Koo (2018, TAR), and so on (thanks to Chunmei Zhu for the literature review).

The model uses lawsuit data obtained from Stanford Law School’s Securities Class Action Clearinghouse. However, the website does not deliver the data in a downloadable format. I write the Python program for extracting the data from the website (a technique called webscraping).

I use Python 3.x and please install all required modules. I provide the data (as of 2019-07-07) in a CSV file for easy download (sca.csv).

 

Posted in Python | 5 Comments

Calculate idiosyncratic stock return volatility

I have noted two slightly different definitions of idiosyncratic stock return volatility in:

  • Campbell, J. Y. and Taksler, G. B. (2003), Equity Volatility and Corporate Bond Yields. The Journal of Finance, 58: 2321–2350. doi:10.1046/j.1540-6261.2003.00607.x
  • Rajgopal, S. and Venkatachalam, M. (2011), Financial reporting quality and idiosyncratic return volatility. Journal of Accounting and Economics, 51: 1–20. doi.org/10.1016/j.jacceco.2010.06.001.

The code in this post is used to calculate Campbell and Taksler’s (2003) idiosyncratic stock return volatility, but it can be easily modified for other definitions.

Specifically, this code requires an input dataset that includes two variables: permno and enddt, where enddt is the date of interest. This code will calculate the standard deviation of daily abnormal returns over the 180 calendar days before (and including) enddt. Abnormal returns will be calculated using four methods: (1) market-adjusted; (2) standard market model; (3) Fama-French three factors; and (4) Fama-French three factors as well as momentum. This code requires at least 21 return observations (one-month trading days) over that 180-day period for a permno to calculate its stock return volatility.

 

Posted in SAS | 8 Comments

Commonly used Stata commands to deal with potential outliers

In accounting archival research, we often take it for granted that we must do something to deal with potential outliers before we run a regression. The commonly used methods are: truncate, winsorize, studentized residuals, and Cook’s distance. I discuss in this post which Stata command to use to implement these four methods.

First of all, why and how we deal with potential outliers is perhaps one of the messiest issues that accounting researchers will encounter, because no one ever gives a definitive and satisfactory answer. In my opinion, only outliers resulting from apparent data errors should be deleted from the sample. That said, this post is not going to answer that messy question; instead, the purpose of this post is to summarize the Stata commands for commonly used methods of dealing with outliers (even if we are not sure whether these methods are appropriate—we all know that is true in accounting research!). Let’s start.

Truncate and winsorize

In my opinion, the best Stata commands to do truncate and winsorize are truncateJ and winsorizeJ written by Judson Caskey. I will save time to explain why, but simply highly recommend his work. Please see his website here.

To install these two user-written commands, you can type:

net from https://sites.google.com/site/judsoncaskey/data
net install utilities.pkg

After the installation, you can type help truncateJ or help winsorizeJ to learn how to use these two commands.

Studentized residuals

The first step is to run a regression without specifying any vce parameter in Stata (i.e., not using robust or clustered error terms). Suppose the dependent variable is y, and independent variables are x1 and x2. The first step should look like this:

regress y x1 x2

Then, use the predict command:

predict rstu if e(sample), rstudent

If the absolute value of rstu exceed certain critical values, the data point will be considered as an outlier and be deleted from the final sample. Stata’s manual indicates that “studentized residuals can be interpreted as the t statistic for testing the significance of a dummy variable equal to 1 in the observation in question and 0 elsewhere. Such a dummy variable would effectively absorb the observation and so remove its influence in determining the other coefficients in the model.” To be honest, I do not fully understand this explanation, but since rstu is a t statistics, the critical value for a traditional significance level should be applied, for example, 1.96 (or 2) for 5% significance level. That’s why in literature we often see that data points with absolute values of studentized residuals greater than 2 will be deleted. Some papers use the critical value of 3, which corresponds to 0.27% significance level, and seems to me not very reasonable.

Now use the following command to drop “outliers” based on the critical value of 2:

drop if abs(rstu) > 2

The last step is to re-run the regression, but this time we can add appropriate vce parameters to address additional issues such as heteroskedasticity:

regress y x1 x2, vce(robust), or

regress y x1 x2, vce(cl gvkey)

Cook’s distance

This method is similar to studentized residuals. We predict a specific residual, namely Cook’s distance, and then delete any data points with Cook’s distance greater than 4/N (Cook’s distance is always positive).

regress y x1 x2

predict cooksd if e(sample), cooksd

drop if cooksd > critical value

Next, re-run the regression with appropriate vce parameters:

regress y x1 x2, vce(robust), or

regress y x1 x2, vce(cl gvkey)

 

Lastly, I thank the authors of the following articles which I benefit from:

https://www3.nd.edu/~rwilliam/stats2/l24.pdf

https://www.stat-d.si/mz/mz16/coend16.pdf

A more formal and complete econometrics book is Belsley, D. A., E. Kuh, and R. E. Welsch. 1980. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley.

Posted in Stata | 4 Comments