`and`

and `or`

in `if`

statement, compared to SAS. For example:
In SAS, we can write `if 2001 <= fyear <= 2010`

. But in Stata, we usually write: `if fyear >= 2001 & fyear <= 2010`

.

In fact, Stata provides a handy `inrange`

function. The above `if`

statement can be written as: `if inrange(fyear, 2001, 2010)`

.

Similarly, Stata provides another `inlist`

function. The syntax is `inlist(z, a, b, ...)`

, which returns 1 if z = a or z = b … In `if`

statement, it is equivalent to `if z = a | z = b | ...`

First of all, please refer to this post to see Stata commands to test equality of mean and median.

However, it is time-consuming to glean numbers from the output of these Stata commands and place them in a table. It is even more struggling that you have to repeat the tedious process every time you update your sample.

I write Stata codes to streamline the process. The codes vary between unpaired (i.e., unmatched) data and paired data.

**Unpaired data**

The above example is unpaired data, i.e., suspect firm-years and other firm-years are not 1-to-1 or 1-to-m matched. One usage of unpaired data is the first step of Heckman’s two-step procedure, in which two groups of observations (i.e., the group that will be selected into the second step and the group that will not be selected into the second step) **are stacked vertically** in the dataset. The following codes are used for unpaired data. You only need to modify the first two lines to suit your data. The codes will generate a table in Stata’s output window like this:

You can then select the output and right-click “Copy as table” and paste in Excel for a quick edit. The codes use t-test for mean and Wilcoxon rank-sum test for median.

local vars retsd cfosd lnilliq authpct //put your variables here local group grpid //"grpid" is your group indicator that takes 1 and 0 foreach v in `vars' { di "`v'" ttest `v', by(`group') local mean_`v'_mean_0=round(r(mu_1),.001) local mean_`v'_mean_1=round(r(mu_2),.001) local mean_`v'_diff=`mean_`v'_mean_1'-`mean_`v'_mean_0' local mean_`v'_p=r(p) } foreach v in `vars' { sum `v' if `group'==0, detail local p50_`v'_p50_0=round(r(p50),.001) sum `v' if `group'==1, detail local p50_`v'_p50_1=round(r(p50),.001) ranksum `v', by(`group') local p50_`v'_n_0=r(N_1) local p50_`v'_n_1=r(N_2) local p50_`v'_diff=`p50_`v'_p50_1'-`p50_`v'_p50_0' local p50_`v'_p=2*normprob(-abs(r(z))) } qui { noi di _newline noi di "{hline 115}" noi di _col(15) "{c |} `group' = 1" /// _col(45) "{c |} `group' = 0" /// _col(75) "{c |} Diff" noi di _col(16) "{hline 100}" noi di _col(15) "{c |} Mean" /// _col(25) "{c |} Median" /// _col(35) "{c |} N" /// _col(45) "{c |} Mean" /// _col(55) "{c |} Median" /// _col(65) "{c |} N" /// _col(75) "{c |} Mean" /// _col(85) "{c |} P" /// _col(95) "{c |} Median" /// _col(105) "{c |} P" noi di "{hline 115}" foreach v in `vars' { noi di %12s abbrev("`v'",12) /// _col(15) "{c |}" %8.3f `mean_`v'_mean_1' /// _col(25) "{c |}" %8.3f `p50_`v'_p50_1' /// _col(35) "{c |}" %8.0f `p50_`v'_n_1' /// _col(45) "{c |}" %8.3f `mean_`v'_mean_0' /// _col(55) "{c |}" %8.3f `p50_`v'_p50_0' /// _col(65) "{c |}" %8.0f `p50_`v'_n_0' /// _col(75) "{c |}" %8.3f `mean_`v'_diff' /// _col(85) "{c |}" %8.3f `mean_`v'_p' /// _col(95) "{c |}" %8.3f `p50_`v'_diff' /// _col(105) "{c |}" %8.3f `p50_`v'_p' } noi di "{hline 115}" }

**Paired data**

A typical usage of paired data is to identify a matched control group for the treatment group. For example, identify a matched firm-year for an event firm-year based on a set of characteristics (same industry, same year, similar size and book-to-market), or identify a matched firm for every event firm based on the closest propensity score (i.e., propensity score matching).

The following table is an example that compares the mean and median of two matched groups—restating firms and non-restating group. Each restating firm is matched with a non-restating firm.

Because of this matching relationship, every event firm and its control firm will be placed in the same row in the dataset. In other words, event firms and control firms are **aligned horizontally**. The following codes are used for paired data. You only need to modify the first two lines to suit your data. You must specify the same number of variables in the matched order in the first two lines. In other words, the first variable in the first line must be paired with the first variable in the second line, and so on. The codes will generate a table in Stata’s output window like this:

The codes use paired t-test for mean and Wilcoxon rank-sign test for median.

//put your paired variables in the first two lines. 1-to-1 correspondece is must local agrp "drpre4 drpre3 drpre2 drpre1" //e.g., treatment group local bgrp "mdrpre4 mdrpre3 mdrpre2 mdrpre1" //e.g., control group local n : word count `agrp' forvalues i = 1/`n' { local a : word `i' of `agrp' local b : word `i' of `bgrp' ttest `a'=`b' local mean_`a'=round(r(mu_1),.001) local mean_`b'=round(r(mu_2),.001) local mean_`a'_diff=`mean_`a''-`mean_`b'' local n_`a'=r(N_1) local mean_p_`a'=r(p) sum `a', detail local p50_`a'=round(r(p50),.001) sum `b', detail local p50_`b'=round(r(p50),.001) signrank `a'=`b' local p50_`a'_diff=round(`p50_`a''-`p50_`b'',.001) local p50_p_`a'=2*normprob(-abs(r(z))) } qui { noi di _newline noi di "{hline 120}" noi di _col(30) "{c |}" /// _col(40) "{c |} Var1" /// _col(60) "{c |} Var2" /// _col(80) "{c |} Diff" noi di _col(41) "{hline 80}" noi di %27s "Paired Var1 and Var2" /// _col(30) "{c |} N" /// _col(40) "{c |} Mean" /// _col(50) "{c |} Median" /// _col(60) "{c |} Mean" /// _col(70) "{c |} Median" /// _col(80) "{c |} Mean" /// _col(90) "{c |} P" /// _col(100) "{c |} Median" /// _col(110) "{c |} P" noi di "{hline 120} forvalues i = 1/`n' { local a : word `i' of `agrp' local b : word `i' of `bgrp' noi di %27s abbrev("`a' vs `b'",27) /// _col(30) "{c |}" %8.0f `n_`a'' /// _col(40) "{c |}" %8.3f `mean_`a'' /// _col(50) "{c |}" %8.3f `p50_`a'' /// _col(60) "{c |}" %8.3f `mean_`b'' /// _col(70) "{c |}" %8.3f `p50_`b'' /// _col(80) "{c |}" %8.3f `mean_`a'_diff' /// _col(90) "{c |}" %8.3f `mean_p_`a'' /// _col(100) "{c |}" %8.3f `p50_`a'_diff' /// _col(110) "{c |}" %8.3f `p50_p_`a'' } noi di "{hline 120} }

]]>

The two steps refer to the following two regressions:

Outcome equation: `y = X × b1 + u1`

Selection equation: `Dummy = Z × b2 + u2`

The selection equation must contain at least one variable that is not in the outcome equation.

The selection equation must be estimated using Probit. An intuitive way to do Heckman’s two steps is to estimate the selection equation first. Then include inverse mills ratio (IMR) derived from the selection equation in the outcome equation. In other words, run two regressions, one after the other.

Stata command for the selection equation:

`probit Dummy X`

(using both observations that are selected into the sample and observations that are not selected into the sample, i.e., `Dummy`

= 1 or `Dummy`

= 0)

Note `vce`

option (i.e., standard, robust or clustered standard errors, among others) will not change the resultant IMR.

Next, calculate IMR immediately:

`predict probitxb, xb`

`ge pdf = normalden(probitxb)`

`ge cdf = normal(probitxb)`

`ge imr = pdf/cdf`

Finally, include `imr`

in the outcome equation:

`reg y X imr, vce(specified_vcetype)`

(using observations that are selected into the sample only)

Note the first and the second regression use different numbers of observations.

However, this is not over. I find the first Probit regression sometimes causes missing IMR. For example, even if I have 100 observations with required `Dummy`

and `X`

data, I may only get IMR for 60 observations using this step-by-step method. I have not figured out why.

I then note that Stata in fact provides an all-in-one method to estimate both the selection equation and the outcome equation in one command `heckman`

:

`heckman y X, select(Dummy = Z), twostep first mills(imr) vce(specfied_vcetype)`

I recommend using `twostep`

option of the `heckman`

command. This option will produce the same results with the step-by-step method. But this option may reduce the number of available `vce`

types. In addition, the specified `vce`

option only applies to the outcome equation and has no effect on the selection equation.

In this all-in-one method, we must pool together both observations that are selected into the sample and observations that are not selected into the sample, in which `Dummy`

is 1 or 0 for all observations and `y`

and `X`

are missing for observations that are not selected into the sample. A benefit of this all-in-one method is that the weird missing-IMR issue will not appear.

I do have a closer look at missing IMR from the step-by-step method. They all have an extremely small value in the all-in-one method. I find that the step-by-step method has greater flexibility. Thus, if we want to use the step-by-step method but encounter the weird missing-IMR issue, it seems safe to just set missing IMR as zero.

Any comment is welcome.

]]>Becker, B., and T. Milbourn. 2011. How did increased competition affect credit ratings? *Journal of Financial Economics* 101 (3):493-514.

See their Table 2 for an overview of the ratings levels for the three main rating agencies and the numerical value assignments used in their empirical work.

]]>**ttest var1 = var2**

Wilcoxon matched-pairs signed-rank test:**signrank var1 = var2**

Sign test of matched pairs:

**signtest var1 = var2**

Unpaired or unmatchedTwo-sample t test:**ttest var, by(groupvar)**

Wilcoxon rank-sum test or Mann_Whitney test:**ranksum var, by(groupvar)**

K-sample equality-of-medians test:

**median var, by(groupvar)**

Please read this post for how to display the results in a ready-for-use format.

UCLA IDRE has posted an article (link) that may provide a bit more explanation. UCLA IDRE is a great resource for learning statistical analysis. A big thank you to them.

]]>`pwcorr`

and `spearman`

, can do the job. However, we have to manually combine Stata output tables when producing the correlation table in the manuscript, which is time-consuming.
I find this fantastic module written by Daniel Klein. His command will return one table that combines Pearson and Spearman correlations and needs the fewest further edits. Thanks Daniel and please find his work here.

A sample command is as follows:

`corsp varlist, pw sig`

To install Daniel’s module, type `ssc install corsp`

in Stata’s command window.

A good technical comparison of Pearson and Spearman correlations can be found here.

]]>`tsset`

). The command to convert string GVKEY to numerical GVEKY is very simple:
`destring gvkey, replace`

The command to revert numerical GVKEY to string GVKEY with leading zeros is as follows:

`tostring gvkey, replace format(%06.0f)`

A better way of comparing the predictive ability of different models is to use the Receiver Operating Characteristic, or ROC curve (e.g., Hosmer and Lemeshow, 2000, Chapter 5). This curve ‘‘plots the probability of detecting a true signal (sensitivity) and false signal (1—speciﬁcity) for the entire range of possible cutpoints’’ (p. 160, our emphasis). The area under the ROC curve (denoted AUC) provides a measure of the model’s ability to discriminate. A value of 0.5 indicates no ability to discriminate (might as well toss a coin) while a value of 1 indicates perfect ability to discriminate, so the effective range of AUC is from 0.5 to 1.0. Hosmer-Lemeshow (2000, p. 162) indicate that AUC of 0.5 indicates no discrimination, AUC of between 0.7 and 0.8 indicates acceptable discrimination, AUC of between 0.8 and 0.9 indicates excellent discrimination, and AUC greater than 0.9 is considered outstanding discrimination.

The Stata command to report AUC is as follows:

`logit y x1 x2`

or `probit y x1 x2`

`lroc, nograph`

The most recent edition of the book Kim and Skinner refer to is Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken, NJ: Wiley.

A technical note from Stata: `lroc`

requires that the current estimation results be from `logistic`

, `logit`

, `probit`

, or `ivprobit`

.

A side question: what’s the difference between logistic and logit regression? Nick Cox’s short answer is: “same thing with different emphases in reporting.” (something like one gives you the odds ratios, the other gives you the log of the odds ratios.)—thanks to a post on Stack Overflow.

]]>`sum var, detail`

reports.
* Figure 1:* Returns are stored in a row

* Figure 2:* Returns are stored in a column

**If returns are stored in a row**

Stata does not provide a command to calculate the skewness in this situation. The following Stata commands will do the job.

egen ret_mean=rowmean(ret1-ret12) egen n=rownonmiss(ret1-ret12) foreach v in ret1 ret2 ret3 ret4 ret5 ret6 ret7 ret8 ret9 ret10 ret11 ret12 { ge `v'_m3=(`v'-ret_mean)^3 } egen m3=rowtotal(ret1_m3-ret12_m3), missing replace m3=m3/n foreach v in ret1 ret2 ret3 ret4 ret5 ret6 ret7 ret8 ret9 ret10 ret11 ret12 { ge `v'_m2=(`v'-ret_mean)^2 } egen m2=rowtotal(ret1_m2-ret12_m2), missing replace m2=m2/n ge ret_skew=m3*m2^(-3/2)

**If**** returns are stored in a column**

Stata provides a command to calculate skewness in this situation (`egen`

and `skewness`

). However, the computation is extremely slow if we have millions of observations. I would suggest calculating the skewness manually as follows:

sort permno (add more variables here to identify a group) by permno: egen ret_mean=mean(ret) by permno: egen n=count(ret) ge ret_m3=(ret-ret_mean)^3 by permno: egen m3=total(ret_m3) replace m3=m3/n ge ret_m2=(ret-ret_mean)^2 by permno: egen m2=total(ret_m2) replace m2=m2/n ge skewness=m3*m2^(-3/2)

]]>

The model uses lawsuit data obtained from Stanford Law School’s Securities Class Action Clearinghouse (http://securities.stanford.edu). However, the website does not deliver the data in a downloadable format. I write the Python program for extracting the data from the website (a technique called webscraping).

I use Python 3.x and please install all required modules. I provide the data (as of 2018/07/15) in a CSV file for easy download (scac.csv).

from urllib import request from bs4 import BeautifulSoup import re from math import ceil import csv # Determine the number of pages to webscrape scac = "http://securities.stanford.edu/filings.html" page = request.urlopen(scac) soup = BeautifulSoup(page, 'html.parser') heading = soup.find_all('h4')[-1].get_text() total_record_num = re.findall(r'\d+', heading)[0] total_page_num = ceil(int(total_record_num) / 20) # Webscrape all pages container = [("filing_name", "filing_date", "district_court", "exchange", "ticker")] i = 1 while i <= total_page_num: url = scac + "?page=" + repr(i) print(url) page = request.urlopen(url) soup = BeautifulSoup(page, 'html.parser') table = soup.find('table', class_ = 'table table-bordered table-striped table-hover') tbody = table.find('tbody') for row in tbody.find_all('tr'): columns = row.find_all('td') c1 = re.sub(r'[\t\n]', '', columns[0].get_text()).strip() c2 = re.sub(r'[\t\n]', '', columns[1].get_text()).strip() c3 = re.sub(r'[\t\n]', '', columns[2].get_text()).strip() c4 = re.sub(r'[\t\n]', '', columns[3].get_text()).strip() c5 = re.sub(r'[\t\n]', '', columns[4].get_text()).strip() container.append((c1, c2, c3, c4, c5)) i = i + 1 # Write to a CSV file with open('scac.csv', 'w', newline='') as csvfile: writer = csv.writer(csvfile) writer.writerows(container)

]]>