In accounting archival research, we often take it for granted that we must do something to deal with potential outliers before we run a regression. The commonly used methods are: truncate, winsorize, studentized residuals, and Cook’s distance. I discuss in this post which Stata command to use to implement these four methods.

First of all, why and how we deal with potential outliers is perhaps one of the messiest issues that accounting researchers will encounter, because no one ever gives a definitive and satisfactory answer. In my opinion, only outliers resulting from apparent data errors should be deleted from the sample. That said, this post is not going to answer that messy question; instead, the purpose of this post is to summarize the Stata commands for commonly used methods of dealing with outliers (even if we are not sure whether these methods are appropriate—we all know that is true in accounting research!). Let’s start.

**Truncate and winsorize**

In my opinion, the best Stata commands to do truncate and winsorize are `truncateJ`

and `winsorizeJ`

written by Judson Caskey. I will save time to explain why, but simply highly recommend his work. Please see his website here.

To install these two user-written commands, you can type:

`net from https://sites.google.com/site/judsoncaskey/data`

`net install utilities.pkg`

After the installation, you can type `help truncateJ`

or `help winsorizeJ`

to learn how to use these two commands.

**Studentized residuals**

The first step is to run a regression without specifying any `vce`

parameter in Stata (i.e., not using robust or clustered error terms). Suppose the dependent variable is `y`

, and independent variables are `x1`

and `x2`

. The first step should look like this:

`regress y x1 x2`

Then, use the `predict`

command:

`predict rstu if e(sample), rstudent`

If the absolute value of `rstu`

exceed certain critical values, the data point will be considered as an outlier and be deleted from the final sample. Stata’s manual indicates that “studentized residuals can be interpreted as the t statistic for testing the significance of a dummy variable equal to 1 in the observation in question and 0 elsewhere. Such a dummy variable would effectively absorb the observation and so remove its influence in determining the other coefficients in the model.” To be honest, I do not fully understand this explanation, but since `rstu`

is a *t* statistics, the critical value for a traditional significance level should be applied, for example, 1.96 (or 2) for 5% significance level. That’s why in literature we often see that data points with absolute values of studentized residuals greater than 2 will be deleted. Some papers use the critical value of 3, which corresponds to 0.27% significance level, and seems to me not very reasonable.

Now use the following command to drop “outliers” based on the critical value of 2:

`drop if abs(rstu) > 2`

The last step is to re-run the regression, but this time we can add appropriate `vce`

parameters to address additional issues such as heteroskedasticity:

`regress y x1 x2, vce(robust)`

, or

`regress y x1 x2, vce(cl gvkey)`

**Cook’s distance**

This method is similar to studentized residuals. We predict a specific residual, namely Cook’s distance, and then delete any data points with Cook’s distance greater than 4/N (Cook’s distance is always positive).

`regress y x1 x2`

`predict cooksd if e(sample), cooksd`

`drop if cooksd > critical value`

Next, re-run the regression with appropriate `vce`

parameters:

`regress y x1 x2, vce(robust)`

, or

`regress y x1 x2, vce(cl gvkey)`

Lastly, I thank the authors of the following articles which I benefit from:

https://www3.nd.edu/~rwilliam/stats2/l24.pdf

https://www.stat-d.si/mz/mz16/coend16.pdf

A more formal and complete econometrics book is Belsley, D. A., E. Kuh, and R. E. Welsch. 1980. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley.

Hi, the winsorize way you mentioned is not working now

I checked and the command worked just fine. Did you have an installation issue or a command running issue?

Hi, I encounter an issue regarding winsorizing. when a self-defined variable ABC is constructed by two common variables, lets say at (total asset) and lt(total liability). we report 3 variables in the summary statistics. do we winsorzie twice? how do we do winsorizing? i.e., do we do winsorizing after winsorzing at and lt or we first calculate ABC and winsorzie it?

thank you.

I prefer calculating ABC using raw

`at`

and`lt`

and then winsorize ABC. But, different researchers may do it differently. Sometimes, when to do winsorization or truncation will affect final results. This is a black box.