Kai Chen

Misusing the Chow Test: What You’re Really Testing

Posted on August 13, 2019 by Kai Chen

A Chow test is used to check for a structural change between two regressions:

The null hypothesis H0 is that all coefficients in the two regressions are the same.
The alternative hypothesis H1 is that at least one of the coefficients is different.

The standard Chow test is joint, checking all coefficients together. However, more often we care about whether an individual coefficient is equal across two groups. If that’s our purpose, we wouldn’t really use the standard Chow test, or state in our work that “we use the Chow test to check the equality of the coefficient on X across the two subsamples”.

In this post, I first show how to perform the standard Chow test. Then I show how to perform a test checking if an individual coefficient is equal across two groups.

Perform the standard Chow test

Assume that we have two groups of observations. We have stacked (combined vertically) the two groups and generated a variable group, which takes the value of either 1 or 2, to indicate if an observation belongs to either group 1 or 2.

We perform the following regressions on the two groups separately:

y = a1 + b1*x1 + c1*x2 + u for group == 1
y = a2 + b2*x1 + c2*x2 + u for group == 2

It’s important to note that the standard Chow test is used to test whether the assertion a1==a2, b1==b2, and c1==c2 holds true jointly; it cannot be used to test whether any single one of them holds true.

The commands for the standard Chow test are listed below:

ge g2=(group==2)
regress y c.x1##i.g2 c.x2##i.g2
test 1.g2 1.g2#c.x1 1.g2#c.x2

The last command is equivalent to:

contrast g2 g2#c.x1 g2#c.x2, overall

The test command and the contrast command (see the line of “Overall”) report the same results. Please note: (1) the F-stat for each independent variable is the square of the corresponding t-stat reported in the regression result table; (2) unlike the z- and t-distributions, the F-distribution is strictly nonnegative and has only a right-tail rejection region. In other words, the F-test is equivalent to a two-sided t-test and cannot be transformed to a one-tailed test.

Test the equality of individual coefficients

More commonly, we’d like to test whether a1 == a2 across the two groups (or b1 == b2). If the equality of an individual coefficient is our main interest, we should set up a pooled regression with the group indicator interacting with every independent variable.

Fortunately, under the hood of the standard Chow test is the pooled regression we have to resort to. We only need to check the regression result table presented above:

If we want to test if a1 == b1, then check the t-stat on 1.g2#c.x1
If we want to test if b1 == b2, then check the t-stat on 1.g2#c.x2

Both one-tailed (if the null hypothesis is directional, such as a1 <= a2) and two-tailed (if the null hypothesis is non-directional, such as a1 == a2) tests are possible.

Another option to perform the test is to use the following commands:

regress y x1 x2 if group==1
est store coefa
regress y x1 x2 if group==2
est store coefb
suest coefa coefb
lincom [coefa_mean]x1-[coefb_mean]x1

The benefit of this set of commands is that the difference in coefficients is reported directly, and the z-stat allows a one-tailed test as well.

I benefit from the two useful articles from Stata’s official website:

Can you explain Chow tests?
How can I compute the Chow test statistic?

Posted in Stata | 6 Comments

SAS macro to count the number of analysts following a firm

Posted on August 8, 2019 by Kai Chen

This macro is used to count the number of analysts who followed a specific firm. Although this is a commonly used measure in literature, prior studies often give a vague description on what they do. The question is—what does “analysts following a firm” really mean?

First, it is only meaningful to count the number at a specified date.

Second, how to define “an analyst is actually following a firm”? I use the following definition: if an analyst issued any forecast (EPS or stock price or sales, anything) within a certain window (e.g., 180 days) before the specified date, then the analyst will be counted in. This definition ensures that the analyst is “actively” following the firm.

That is why my macro requires two arguments: DATE and WINDOW. This macro is used to answer such question—at a specified date, how many analysts are actively following Firm A, B, …?

%MACRO ANALYST_COUNT(INFILE=, TICKER=, DATE=, WINDOW=, OUTFILE=);
 
/* This macro is used to count the number of analysts who followed a     */
/* specific firm at a specified date (DATE). Any analyst who issued any  */
/* forecast during the window (WINDOW) before the specified date (DATE)  */
/* will be counted in.                                                   */
 
/* This macro use both Detailed History Unadjusted (EPS for US Region)   */
/* and Unadjusted (Non-EPS for US Region). INFILE should contain IBES    */
/* Ticker (TICKER) and DATE.                                             */
 
options mprint;

/* Stack Detailed History Unadjusted (EPS for US Region) and  */
/* Unadjusted (Non-EPS for US Region).                          */
data detu;
  set ibes.detu_epsus ibes.detu_xepsus;
run;

/* Merge analysts who issued a forecast during the window.  */
proc sql;
  create table ibes1
  as select a.*, b.estimator, b.analys, b.anndats
  from (select distinct &TICKER, &DATE from &INFILE) a, detu b
  where not missing(a.&TICKER) and a.&TICKER=b.ticker and 
        not missing(a.&DATE) and a.&DATE-&WINDOW+1<=b.anndats<=a.&DATE and
        not missing(b.value);
quit;

/* Retain the most recent forecast from a specific analyst. */
proc sort data=ibes1; by &TICKER &DATE estimator analys descending anndats; run;
proc sort data=ibes1 out=ibes2 nodupkey; by &TICKER &DATE estimator analys; run;

/* Count the number of analysts who issued a forecast during the window.  */
proc sql;
  create table ibes3
  as select distinct &TICKER, &DATE, count(anndats) as analyst_count
  from ibes2
  group by &TICKER, &DATE;
quit;

/* Merge INFILE with number of analysts  */
proc sql;
  create table &OUTFILE
  as select a.*, b.analyst_count
  from &INFILE a left join ibes3 b
  on a.&TICKER=b.&TICKER and a.&DATE=b.&DATE;
quit;

proc sql;
  drop table detu, ibes1, ibes2;
quit;

%MEND;

%MACRO ANALYST_COUNT(INFILE=, TICKER=, DATE=, WINDOW=, OUTFILE=);

/* This macro is used to count the number of analysts who followed a */

/* specific firm at a specified date (DATE). Any analyst who issued any */

/* forecast during the window (WINDOW) before the specified date (DATE) */

/* will be counted in. */

/* This macro use both Detailed History Unadjusted (EPS for US Region) */

/* and Unadjusted (Non-EPS for US Region). INFILE should contain IBES */

/* Ticker (TICKER) and DATE. */

options mprint;

/* Stack Detailed History Unadjusted (EPS for US Region) and */

/* Unadjusted (Non-EPS for US Region). */

data detu;

set ibes.detu_epsus ibes.detu_xepsus;

run;

/* Merge analysts who issued a forecast during the window. */

proc sql;

create table ibes1

as select a.*, b.estimator, b.analys, b.anndats

from (select distinct &TICKER, &DATE from &INFILE) a, detu b

where not missing(a.&TICKER) and a.&TICKER=b.ticker and

not missing(a.&DATE) and a.&DATE-&WINDOW+1<=b.anndats<=a.&DATE and

not missing(b.value);

quit;

/* Retain the most recent forecast from a specific analyst. */

proc sort data=ibes1; by &TICKER &DATE estimator analys descending anndats; run;

proc sort data=ibes1 out=ibes2 nodupkey; by &TICKER &DATE estimator analys; run;

/* Count the number of analysts who issued a forecast during the window. */

proc sql;

create table ibes3

as select distinct &TICKER, &DATE, count(anndats) as analyst_count

from ibes2

group by &TICKER, &DATE;

quit;

/* Merge INFILE with number of analysts */

proc sql;

create table &OUTFILE

as select a.*, b.analyst_count

from &INFILE a left join ibes3 b

on a.&TICKER=b.&TICKER and a.&DATE=b.&DATE;

quit;

proc sql;

drop table detu, ibes1, ibes2;

quit;

%MEND;

Posted in SAS | 4 Comments

Stata command to lowercase all variable names or the values of all string variables

Posted on July 9, 2019 by Kai Chen

Stata is a case-sensitive application, which can sometimes cause trouble. Therefore, we might want to convert all variable names or the values of all string variables to lowercase before further data processing. This post gives a quick method to do this.

Lowercase all variable names

The rename command is used for this purpose. Instead of renaming variables one by one, we can rename all variables with a single command:

rename *, lower or rename _all, lower

A related post can be found at https://www.kaichen.work/?p=1967. Mingze Gao wrote a macro to achieve the same function in SAS (here).

Lowercase the values of all string variables

ustrlower(string) or strlower(string) will do the trick. Instead of applying the ustrlower or strlower function to string variables individually, we can benefit from lowercasing the values of all string variables in a short loop:

ds, has(type string)
foreach v in `r(varlist)' {
	replace `v'=strlower(`v')
}

ds, has(type string)

foreach v in `r(varlist)' {

replace `v'=strlower(`v')

}

In this loop, ds is used to find string type variables and foreach is used to iterate over each string variable to replace its values with the lowercase version.

Posted in Stata | 2 Comments

Stata command to order tabulation result with only top values shown

Posted on July 5, 2019 by Kai Chen

tabulate varname command is handy in Stata, but sometimes it returns a too long result, if varname contains too many unique values.

The third-party command, groups, will solve the problem by showing top values only. Please use ssc install groups to install groups. The usage of group is very similar to tabulate. Here are some examples:

. sysuse auto
(1978 Automobile Data)

. groups mpg, order(h) select(5)

  +-------------------------------+
  | mpg   Freq.   Percent    Cum. |
  |-------------------------------|
  |  18       9     12.16   12.16 |
  |  19       8     10.81   22.97 |
  |  14       6      8.11   31.08 |
  |  21       5      6.76   37.84 |
  |  22       5      6.76   44.59 |
  +-------------------------------+

. groups mpg, order(h) select(f >= 3)

  +-------------------------------+
  | mpg   Freq.   Percent    Cum. |
  |-------------------------------|
  |  18       9     12.16   12.16 |
  |  19       8     10.81   22.97 |
  |  14       6      8.11   31.08 |
  |  21       5      6.76   37.84 |
  |  22       5      6.76   44.59 |
  |-------------------------------|
  |  25       5      6.76   51.35 |
  |  16       4      5.41   56.76 |
  |  17       4      5.41   62.16 |
  |  24       4      5.41   67.57 |
  |  20       3      4.05   71.62 |
  |-------------------------------|
  |  23       3      4.05   75.68 |
  |  26       3      4.05   79.73 |
  |  28       3      4.05   83.78 |
  +-------------------------------+

. sysuse auto

(1978 Automobile Data)

. groups mpg, order(h) select(5)

+-------------------------------+

| mpg Freq. Percent Cum. |

|-------------------------------|

| 18 9 12.16 12.16 |

| 19 8 10.81 22.97 |

| 14 6 8.11 31.08 |

| 21 5 6.76 37.84 |

| 22 5 6.76 44.59 |

+-------------------------------+

. groups mpg, order(h) select(f >= 3)

+-------------------------------+

| mpg Freq. Percent Cum. |

|-------------------------------|

| 18 9 12.16 12.16 |

| 19 8 10.81 22.97 |

| 14 6 8.11 31.08 |

| 21 5 6.76 37.84 |

| 22 5 6.76 44.59 |

|-------------------------------|

| 25 5 6.76 51.35 |

| 16 4 5.41 56.76 |

| 17 4 5.41 62.16 |

| 24 4 5.41 67.57 |

| 20 3 4.05 71.62 |

|-------------------------------|

| 23 3 4.05 75.68 |

| 26 3 4.05 79.73 |

| 28 3 4.05 83.78 |

+-------------------------------+

Posted in Stata | Leave a comment

Empower “and” and “or” in IF statement in Stata

Posted on May 13, 2019 by Kai Chen

Stata is a little bit awkward when using and and or in if statement, compared to SAS. For example:

In SAS, we can write if 2001 <= fyear <= 2010. But in Stata, we usually write: if fyear >= 2001 & fyear <= 2010.

In fact, Stata provides a handy inrange function. The above if statement can be written as: if inrange(fyear, 2001, 2010).

Similarly, Stata provides another inlist function. The syntax is inlist(z, a, b, ...), which returns 1 if z = a or z = b … In if statement, it is equivalent to if z = a | z = b | ...

Posted in Stata | 1 Comment

Display mean and median test results in Stata

Posted on April 18, 2019 by Kai Chen

Sometimes we may want to produce the following table to compare the mean and median of two groups:

First of all, please refer to this post to see Stata commands to test equality of mean and median.

However, it is time-consuming to glean numbers from the output of these Stata commands and place them in a table. It is even more struggling that you have to repeat the tedious process every time you update your sample.

I write Stata codes to streamline the process. The codes vary between unpaired (i.e., unmatched) data and paired data.

Unpaired data

The above example is unpaired data, i.e., suspect firm-years and other firm-years are not 1-to-1 or 1-to-m matched. One usage of unpaired data is the first step of Heckman’s two-step procedure, in which two groups of observations (i.e., the group that will be selected into the second step and the group that will not be selected into the second step) are stacked vertically in the dataset. The following codes are used for unpaired data. You only need to modify the first two lines to suit your data. The codes will generate a table in Stata’s output window like this:

You can then select the output and right-click “Copy as table” and paste in Excel for a quick edit. The codes use t-test for mean and Wilcoxon rank-sum test for median.

local vars retsd cfosd lnilliq authpct   //put your variables here
local group grpid   //"grpid" is your group indicator that takes 1 and 0
 
foreach v in `vars' {
  di "`v'"
  ttest `v', by(`group')
  local mean_`v'_mean_0=round(r(mu_1),.001)
  local mean_`v'_mean_1=round(r(mu_2),.001)
  local mean_`v'_diff=`mean_`v'_mean_1'-`mean_`v'_mean_0'
  local mean_`v'_p=r(p)
}
 
foreach v in `vars' {
  sum `v' if `group'==0, detail
  local p50_`v'_p50_0=round(r(p50),.001)
  sum `v' if `group'==1, detail
  local p50_`v'_p50_1=round(r(p50),.001)
  ranksum `v', by(`group')
  local p50_`v'_n_0=r(N_1)
  local p50_`v'_n_1=r(N_2)
  local p50_`v'_diff=`p50_`v'_p50_1'-`p50_`v'_p50_0'
  local p50_`v'_p=2*normprob(-abs(r(z)))
}
 
qui {
  noi di _newline
  noi di "{hline 115}"
  noi di _col(15) "{c |} `group' = 1" ///
         _col(45) "{c |} `group' = 0" ///
         _col(75) "{c |} Diff"
  noi di _col(16) "{hline 100}"
  noi di _col(15) "{c |} Mean" ///
         _col(25) "{c |} Median" ///
		 _col(35) "{c |} N" ///
		 _col(45) "{c |} Mean" ///
		 _col(55) "{c |} Median" ///
		 _col(65) "{c |} N" ///
		 _col(75) "{c |} Mean" ///
		 _col(85) "{c |} P" ///
		 _col(95) "{c |} Median" ///
		 _col(105) "{c |} P"
  noi di "{hline 115}"
  foreach v in `vars' {
    noi di %12s abbrev("`v'",12) ///
	       _col(15) "{c |}" %8.3f `mean_`v'_mean_1' ///
		   _col(25) "{c |}" %8.3f `p50_`v'_p50_1' ///
		   _col(35) "{c |}" %8.0f `p50_`v'_n_1' ///
		   _col(45) "{c |}" %8.3f `mean_`v'_mean_0' ///
		   _col(55) "{c |}" %8.3f `p50_`v'_p50_0' ///
		   _col(65) "{c |}" %8.0f `p50_`v'_n_0' ///
		   _col(75) "{c |}" %8.3f `mean_`v'_diff' ///
		   _col(85) "{c |}" %8.3f `mean_`v'_p' ///
		   _col(95) "{c |}" %8.3f `p50_`v'_diff' ///
		   _col(105) "{c |}" %8.3f `p50_`v'_p' 
  }
  noi di "{hline 115}"
}

local vars retsd cfosd lnilliq authpct //put your variables here

local group grpid //"grpid" is your group indicator that takes 1 and 0

foreach v in `vars' {

di "`v'"

ttest `v', by(`group')

local mean_`v'_mean_0=round(r(mu_1),.001)

local mean_`v'_mean_1=round(r(mu_2),.001)

local mean_`v'_diff=`mean_`v'_mean_1'-`mean_`v'_mean_0'

local mean_`v'_p=r(p)

}

foreach v in `vars' {

sum `v' if `group'==0, detail

local p50_`v'_p50_0=round(r(p50),.001)

sum `v' if `group'==1, detail

local p50_`v'_p50_1=round(r(p50),.001)

ranksum `v', by(`group')

local p50_`v'_n_0=r(N_1)

local p50_`v'_n_1=r(N_2)

local p50_`v'_diff=`p50_`v'_p50_1'-`p50_`v'_p50_0'

local p50_`v'_p=2*normprob(-abs(r(z)))

}

qui {

noi di _newline

noi di "{hline 115}"

noi di _col(15) "{c |} `group' = 1" ///

_col(45) "{c |} `group' = 0" ///

_col(75) "{c |} Diff"

noi di _col(16) "{hline 100}"

noi di _col(15) "{c |} Mean" ///

_col(25) "{c |} Median" ///

_col(35) "{c |} N" ///

_col(45) "{c |} Mean" ///

_col(55) "{c |} Median" ///

_col(65) "{c |} N" ///

_col(75) "{c |} Mean" ///

_col(85) "{c |} P" ///

_col(95) "{c |} Median" ///

_col(105) "{c |} P"

noi di "{hline 115}"

foreach v in `vars' {

noi di %12s abbrev("`v'",12) ///

_col(15) "{c |}" %8.3f `mean_`v'_mean_1' ///

_col(25) "{c |}" %8.3f `p50_`v'_p50_1' ///

_col(35) "{c |}" %8.0f `p50_`v'_n_1' ///

_col(45) "{c |}" %8.3f `mean_`v'_mean_0' ///

_col(55) "{c |}" %8.3f `p50_`v'_p50_0' ///

_col(65) "{c |}" %8.0f `p50_`v'_n_0' ///

_col(75) "{c |}" %8.3f `mean_`v'_diff' ///

_col(85) "{c |}" %8.3f `mean_`v'_p' ///

_col(95) "{c |}" %8.3f `p50_`v'_diff' ///

_col(105) "{c |}" %8.3f `p50_`v'_p'

}

noi di "{hline 115}"

}

Paired data

A typical usage of paired data is to identify a matched control group for the treatment group. For example, identify a matched firm-year for an event firm-year based on a set of characteristics (same industry, same year, similar size and book-to-market), or identify a matched firm for every event firm based on the closest propensity score (i.e., propensity score matching).

The following table is an example that compares the mean and median of two matched groups—restating firms and non-restating group. Each restating firm is matched with a non-restating firm.

Because of this matching relationship, every event firm and its control firm will be placed in the same row in the dataset. In other words, event firms and control firms are aligned horizontally. The following codes are used for paired data. You only need to modify the first two lines to suit your data. You must specify the same number of variables in the matched order in the first two lines. In other words, the first variable in the first line must be paired with the first variable in the second line, and so on. The codes will generate a table in Stata’s output window like this:

The codes use paired t-test for mean and Wilcoxon rank-sign test for median.

//put your paired variables in the first two lines. 1-to-1 correspondece is must
local agrp "drpre4 drpre3 drpre2 drpre1"		//e.g., treatment group
local bgrp "mdrpre4 mdrpre3 mdrpre2 mdrpre1"	//e.g., control group

local n : word count `agrp'

forvalues i = 1/`n' {
  local a : word `i' of `agrp'
  local b : word `i' of `bgrp'
  ttest `a'=`b'
  local mean_`a'=round(r(mu_1),.001)
  local mean_`b'=round(r(mu_2),.001)
  local mean_`a'_diff=`mean_`a''-`mean_`b''
  local n_`a'=r(N_1)
  local mean_p_`a'=r(p)

  sum `a', detail
  local p50_`a'=round(r(p50),.001)
  sum `b', detail
  local p50_`b'=round(r(p50),.001)
  signrank `a'=`b'
  local p50_`a'_diff=round(`p50_`a''-`p50_`b'',.001)
  local p50_p_`a'=2*normprob(-abs(r(z)))
}

qui {
  noi di _newline
  noi di "{hline 120}"
  noi di _col(30) "{c |}" ///
         _col(40) "{c |} Var1" ///
		 _col(60) "{c |} Var2" ///
		 _col(80) "{c |} Diff"
  noi di _col(41) "{hline 80}"
  noi di %27s "Paired Var1 and Var2" ///
         _col(30) "{c |} N" ///
		 _col(40) "{c |} Mean" ///
		 _col(50) "{c |} Median" ///
		 _col(60) "{c |} Mean" ///
		 _col(70) "{c |} Median" ///
		 _col(80) "{c |} Mean" ///
		 _col(90) "{c |} P" ///
		 _col(100) "{c |} Median" ///
		 _col(110) "{c |} P"
  noi di "{hline 120}
  forvalues i = 1/`n' {
    local a : word `i' of `agrp'
    local b : word `i' of `bgrp'
    noi di %27s abbrev("`a' vs `b'",27) ///
	       _col(30) "{c |}" %8.0f `n_`a'' ///
		   _col(40) "{c |}" %8.3f `mean_`a'' ///
		   _col(50) "{c |}" %8.3f `p50_`a'' ///
		   _col(60) "{c |}" %8.3f `mean_`b'' ///
		   _col(70) "{c |}" %8.3f `p50_`b'' ///
		   _col(80) "{c |}" %8.3f `mean_`a'_diff' ///
		   _col(90) "{c |}" %8.3f `mean_p_`a'' ///
		   _col(100) "{c |}" %8.3f `p50_`a'_diff' ///
		   _col(110) "{c |}" %8.3f `p50_p_`a''
  }
  noi di "{hline 120}
}

//put your paired variables in the first two lines. 1-to-1 correspondece is must

local agrp "drpre4 drpre3 drpre2 drpre1" //e.g., treatment group

local bgrp "mdrpre4 mdrpre3 mdrpre2 mdrpre1" //e.g., control group

local n : word count `agrp'

forvalues i = 1/`n' {

local a : word `i' of `agrp'

local b : word `i' of `bgrp'

ttest `a'=`b'

local mean_`a'=round(r(mu_1),.001)

local mean_`b'=round(r(mu_2),.001)

local mean_`a'_diff=`mean_`a''-`mean_`b''

local n_`a'=r(N_1)

local mean_p_`a'=r(p)

sum `a', detail

local p50_`a'=round(r(p50),.001)

sum `b', detail

local p50_`b'=round(r(p50),.001)

signrank `a'=`b'

local p50_`a'_diff=round(`p50_`a''-`p50_`b'',.001)

local p50_p_`a'=2*normprob(-abs(r(z)))

}

qui {

noi di _newline

noi di "{hline 120}"

noi di _col(30) "{c |}" ///

_col(40) "{c |} Var1" ///

_col(60) "{c |} Var2" ///

_col(80) "{c |} Diff"

noi di _col(41) "{hline 80}"

noi di %27s "Paired Var1 and Var2" ///

_col(30) "{c |} N" ///

_col(40) "{c |} Mean" ///

_col(50) "{c |} Median" ///

_col(60) "{c |} Mean" ///

_col(70) "{c |} Median" ///

_col(80) "{c |} Mean" ///

_col(90) "{c |} P" ///

_col(100) "{c |} Median" ///

_col(110) "{c |} P"

noi di "{hline 120}

forvalues i = 1/`n' {

local a : word `i' of `agrp'

local b : word `i' of `bgrp'

noi di %27s abbrev("`a' vs `b'",27) ///

_col(30) "{c |}" %8.0f `n_`a'' ///

_col(40) "{c |}" %8.3f `mean_`a'' ///

_col(50) "{c |}" %8.3f `p50_`a'' ///

_col(60) "{c |}" %8.3f `mean_`b'' ///

_col(70) "{c |}" %8.3f `p50_`b'' ///

_col(80) "{c |}" %8.3f `mean_`a'_diff' ///

_col(90) "{c |}" %8.3f `mean_p_`a'' ///

_col(100) "{c |}" %8.3f `p50_`a'_diff' ///

_col(110) "{c |}" %8.3f `p50_p_`a''

}

noi di "{hline 120}

}

Posted in Stata | 22 Comments

Stata command to do Heckman two steps

Posted on March 24, 2019 by Kai Chen

We often see Heckman’s two steps in accounting literature. But how to do it in Stata?

The two steps refer to the following two regressions:

Outcome equation: y = X × b1 + u1
Selection equation: Dummy = Z × b2 + u2

The selection equation must contain at least one variable that is not in the outcome equation.

The selection equation must be estimated using Probit. An intuitive way to do Heckman’s two steps is to estimate the selection equation first. Then include inverse mills ratio (IMR) derived from the selection equation in the outcome equation. In other words, run two regressions, one after the other.

Stata command for the selection equation:

probit Dummy X (using both observations that are selected into the sample and observations that are not selected into the sample, i.e., Dummy = 1 or Dummy = 0)

Note vce option (i.e., standard, robust or clustered standard errors, among others) will not change the resultant IMR.

Next, calculate IMR immediately:

predict probitxb, xb
ge pdf = normalden(probitxb)
ge cdf = normal(probitxb)
ge imr = pdf/cdf

Finally, include imr in the outcome equation:

reg y X imr, vce(specified_vcetype) (using observations that are selected into the sample only)

Note the first and the second regression use different numbers of observations.

However, this is not over. I find the first Probit regression sometimes causes missing IMR. For example, even if I have 100 observations with required Dummy and X data, I may only get IMR for 60 observations using this step-by-step method. I have not figured out why.

I then note that Stata in fact provides an all-in-one method to estimate both the selection equation and the outcome equation in one command heckman:

heckman y X, select(Dummy = Z), twostep first mills(imr) vce(specfied_vcetype)

I recommend using twostep option of the heckman command. This option will produce the same results with the step-by-step method. But this option may reduce the number of available vce types. In addition, the specified vce option only applies to the outcome equation and has no effect on the selection equation.

In this all-in-one method, we must pool together both observations that are selected into the sample and observations that are not selected into the sample, in which Dummy is 1 or 0 for all observations and y and X are missing for observations that are not selected into the sample. A benefit of this all-in-one method is that the weird missing-IMR issue will not appear.

I do have a closer look at missing IMR from the step-by-step method. They all have an extremely small value in the all-in-one method. I find that the step-by-step method has greater flexibility. Thus, if we want to use the step-by-step method but encounter the weird missing-IMR issue, it seems safe to just set missing IMR as zero.

Any comment is welcome.

Posted in Stata | 9 Comments

The calculation of average credit rating using ratings from three rating agencies

Posted on August 24, 2018 by Kai Chen

I was doing something in Finance and wanted to calculate the average rounded credit rating. Basically, I need to translate textual grades (e.g., AAA, Baa) to a numerical value. I found a clue in the following paper:

Becker, B., and T. Milbourn. 2011. How did increased competition affect credit ratings? Journal of Financial Economics 101 (3):493-514.

See their Table 2 for an overview of the ratings levels for the three main rating agencies and the numerical value assignments used in their empirical work.

Posted in Data | 1 Comment

Stata command to test equality of mean and median

Posted on August 18, 2018 by Kai Chen

Data	Command to Test Equality of Mean	Command to Test Equality of Median
Paired or matched	Paired t test: `ttest var1 = var2`	Wilcoxon matched-pairs signed-rank test: `signrank var1 = var2` Sign test of matched pairs: `signtest var1 = var2`
Unpaired or unmatched	Two-sample t test: `ttest var, by(groupvar)`	Wilcoxon rank-sum test or Mann_Whitney test: `ranksum var, by(groupvar)` K-sample equality-of-medians test: `median var, by(groupvar)`

Please read this post for how to display the results in a ready-for-use format.

UCLA IDRE has posted an article (link) that may provide a bit more explanation. UCLA IDRE is a great resource for learning statistical analysis. A big thank you to them.

Posted in Stata | 3 Comments

Stata command to display combined Pearson and Spearman correlation matrix

Posted on August 12, 2018 by Kai Chen

Oftentimes we would like to display Pearson correlations below the diagonal and Spearman correlations above the diagonal. Two built-in commands, pwcorr and spearman, can do the job. However, we have to manually combine Stata output tables when producing the correlation table in the manuscript, which is time-consuming.

I find this fantastic module written by Daniel Klein. His command will return one table that combines Pearson and Spearman correlations and needs the fewest further edits. Thanks Daniel and please find his work here.

A sample command is as follows:

corsp varlist, pw sig

To install Daniel’s module, type ssc install corsp in Stata’s command window.

A good technical comparison of Pearson and Spearman correlations can be found here.

Posted in Stata | 1 Comment

Misusing the Chow Test: What You’re Really Testing

SAS macro to count the number of analysts following a firm

Stata command to lowercase all variable names or the values of all string variables

Lowercase all variable names

Lowercase the values of all string variables

Stata command to order tabulation result with only top values shown

Empower “and” and “or” in IF statement in Stata

Display mean and median test results in Stata

Stata command to do Heckman two steps

The calculation of average credit rating using ratings from three rating agencies

Stata command to test equality of mean and median

Stata command to display combined Pearson and Spearman correlation matrix

Categories

Archives

Site Admin