Commonly used Stata commands to deal with potential outliers

In accounting archival research, we often take it for granted that we must do something to deal with potential outliers before we run a regression. The commonly used methods are: truncate, winsorize, studentized residuals, and Cook’s distance. I discuss in this post which Stata command to use to implement these four methods.

First of all, why and how we deal with potential outliers is perhaps one of the messiest issues that accounting researchers will encounter, because no one ever gives a definitive and satisfactory answer. In my opinion, only outliers resulting from apparent data errors should be deleted from the sample. That said, this post is not going to answer that messy question; instead, the purpose of this post is to summarize the Stata commands for commonly used methods of dealing with outliers (even if we are not sure whether these methods are appropriate—we all know that is true in accounting research!). Let’s start.

Truncate and winsorize

In my opinion, the best Stata commands to do truncate and winsorize are truncateJ and winsorizeJ written by Judson Caskey. I will save time to explain why, but simply highly recommend his work. Please see his website here.

To install these two user-written commands, you can type:

net from https://sites.google.com/site/judsoncaskey/data
net install utilities.pkg

After the installation, you can type help truncateJ or help winsorizeJ to learn how to use these two commands.

Studentized residuals

The first step is to run a regression without specifying any vce parameter in Stata (i.e., not using robust or clustered error terms). Suppose the dependent variable is y, and independent variables are x1 and x2. The first step should look like this:

regress y x1 x2

Then, use the predict command:

predict rstu if e(sample), rstudent

If the absolute value of rstu exceed certain critical values, the data point will be considered as an outlier and be deleted from the final sample. Stata’s manual indicates that “studentized residuals can be interpreted as the t statistic for testing the significance of a dummy variable equal to 1 in the observation in question and 0 elsewhere. Such a dummy variable would effectively absorb the observation and so remove its influence in determining the other coefficients in the model.” To be honest, I do not fully understand this explanation, but since rstu is a t statistics, the critical value for a traditional significance level should be applied, for example, 1.96 (or 2) for 5% significance level. That’s why in literature we often see that data points with absolute values of studentized residuals greater than 2 will be deleted. Some papers use the critical value of 3, which corresponds to 0.27% significance level, and seems to me not very reasonable.

Now use the following command to drop “outliers” based on the critical value of 2:

drop if abs(rstu) > 2

The last step is to re-run the regression, but this time we can add appropriate vce parameters to address additional issues such as heteroskedasticity:

regress y x1 x2, vce(robust), or

regress y x1 x2, vce(cl gvkey)

Cook’s distance

This method is similar to studentized residuals. We predict a specific residual, namely Cook’s distance, and then delete any data points with Cook’s distance greater than 4/N (Cook’s distance is always positive).

regress y x1 x2

predict cooksd if e(sample), cooksd

drop if cooksd > critical value

Next, re-run the regression with appropriate vce parameters:

regress y x1 x2, vce(robust), or

regress y x1 x2, vce(cl gvkey)

 

Lastly, I thank the authors of the following articles which I benefit from:

https://www3.nd.edu/~rwilliam/stats2/l24.pdf

https://www.stat-d.si/mz/mz16/coend16.pdf

A more formal and complete econometrics book is Belsley, D. A., E. Kuh, and R. E. Welsch. 1980. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley.

This entry was posted in Stata. Bookmark the permalink.

5 Responses to Commonly used Stata commands to deal with potential outliers

  1. Lucas says:

    Hi, the winsorize way you mentioned is not working now

  2. Yuchen says:

    Hi, I encounter an issue regarding winsorizing. when a self-defined variable ABC is constructed by two common variables, lets say at (total asset) and lt(total liability). we report 3 variables in the summary statistics. do we winsorzie twice? how do we do winsorizing? i.e., do we do winsorizing after winsorzing at and lt or we first calculate ABC and winsorzie it?

    thank you.

    • Kai Chen says:

      I prefer calculating ABC using raw at and lt and then winsorize ABC. But, different researchers may do it differently. Sometimes, when to do winsorization or truncation will affect final results. This is a black box.

  3. Sylvia says:

    Hi,
    Thank you for your post. It is really helpful to me (and has been really helpful to many colleagues of mine).

    I have one question, though. If I first run a simple model in which I only specify e.g. 3 variables, let’s say Price EquityperShare and NetIncomeperShare and afterwards alter the specification by introducing more variables, when do I use the studentized residuals to treat outliers? Do I run the model with most variables, treat outliers and then rerun the models with fewer variables? Or should I do it separately per specification?

    Thank you in advance.

Leave a Reply to Lucas Cancel reply

Your email address will not be published. Required fields are marked *