Use Stata to do propensity score matching (PSM)

Most propensity score matching (PSM) examples are using cross-sectional data instead of panel data. However, accounting research often uses panel data (i.e., observations with two subscripts i and t, e.g. firm-years) in a difference-in-differences (DID) research design, so that there are two dummy variables, TREATMENT and POST, in the following regression:


where TREATMENT often indicates an event and POST indicates before or after that event. It is common that we do a one-to-one matching, and it arguably makes more sense that such one-to-one matching is done by using selected pre-event and firm-level variables (Xs). The pre-event variables can be measured either at the most recent date before the event (e.g., the total assets at the most recent quarter end before the event) or at the average over the pre-event period (e.g., the average total assets in the four quarters preceding the event).

We need to do a probit or logit regression for PSM:

TREATMENT = X1 + X2 + …

The single nearest neighbour in terms of propensity score will be selected as the matched control, and then DID regressions can be done subsequently.

psmatch2 is a user-written module to find out matched controls using PSM. First, we need to install the module in Stata by typing:

Then the following command should work in most cases:

There are three options in the above command:

  • noreplacement – perform one-to-one matching without replacement. I would add this option to find more unique matched controls.
  • logit – use logit instead of the default probit to estimate the propensity score. I am indifference on this option
  • descending – more details about this option can be found in Lunt (2014). The author concludes that “in the absence of a caliper (another option I would omit to maximize matched controls), the descending method provides the best matches, particularly when there is a large separation between exposed (treated) and unexposed (untreated) subjects.” So, I would add this option.

psmatch2 creates a number of variables, of which the following two are the most useful for subsequent DID regressions:

  • _id – In the case of one-to-one and nearest-neighbors matching, a new identifier created for all observations.
  • _n1 – In the case of one-to-one and nearest-neighbors matching, for every treatment observation, it stores the new identifier (_id) of the matched control observation.

There is a limitation with psmatch2. Sometimes we may want the treatment and its matched control to have the same value on a variable X. For example, we may want the treatment and its matched control to be drawn from the same industry, or both to be male or female. psmatch2 seems incapable on this. Some imperfect solutions are discussed in this post (i.e., adding i.INDUSTRY or i.GENDER in Xs). In contrast, the PSMATCH procedure in SAS seems to have a perfect solution by providing the EXACT= statement (although I don’t know if SAS implements a stratification method. If yes, psmatch2 can also do so by tweaking its options.) More details about the SAS procedure can be found in this manual.

Another conclusion is that psmatch2 is preferable to Stata’s built-in command teffects, because we need the variables generated by psmatch2 (e.g., _id and _n1) for subsequent DID regressions, while teffects do not return such variables.

This article aims at providing a quick how-to and thus ignore some necessary steps for PSM, such as assessing covariate’s balance. More rigorous discussion on PSM in accounting research can be found in Shipman, Swanquist, and Whited (2017).

I benefit from the following articles and Thanks to both authors:

Posted in Stata | Leave a comment

Export a SAS dataset to Stata with all variable names converted to lowercase

I use both SAS and Stata and often need to transfer data between the two. SAS is case-sensitive and Stata is not. I always prefer working with lowercase variable names in Stata. The following code is used to export a SAS dataset to Stata with all variables names converted to lowercase.

The macro I use is borrowed from Adrian’s work. Thanks Adrian.

A related post can be found here:


Posted in SAS | Leave a comment

Clean up TRACE Enhanced dataset

WRDS provides an excellent manual (link) and SAS code (link) for cleaning up the raw TRACE Enhanced bond transaction data, primarily based on the work done by Dick‐Nielsen, Jens, How to Clean Enhanced TRACE Data (December 3, 2014). Available at SSRN: Dick‐Nielsen also provides his SAS code for the clean-up. Several papers refer to his cleaning steps.

Both WRDS and Dick-Nielsen’s codes remove cancellations, corrections, reversals, and double counting of agency trades. Dick-Nielsen’s code provides a few more options, e.g., remove commissioned trades.

Posted in SAS | Leave a comment

Stata command to perform Chow test

A Chow test is simply a test of whether the coefficients estimated over one group of the data are equal to the coefficients estimated over another.

I find two useful articles from Stata’s official website:

Can you explain Chow tests?
How can I compute the Chow test statistic?

Suppose we do following regressions separately in two groups:

regress y x1 x2 if group==1 and regress y x1 x2 if group==2

Then following commands will test the equality of coefficients on x1 and x2:

ge g2=(group==2)
regress y c.x1##i.g2 c.x2##i.g2
contrast g2 g2#c.x1 g2#c.x2, overall

Stata’s official website gives an example of the output:

In this example, to test the equality of coefficients on x1 and x2, 6.06 and 2.80 are the F-stats that we are looking for.

Posted in Stata | 5 Comments

SAS macro to count the number of analysts following a firm

This macro is used to count the number of analysts who followed a specific firm. Although this is a commonly used measure in literature, prior studies often give a vague description on what they do. The question is—what does “analysts following a firm” really mean?

First, it is only meaningful to count the number at a specified date.

Second, how to define “an analyst is actually following a firm”? I use the following definition: if an analyst issued any forecast (EPS or stock price or sales, anything) within a certain window (e.g., 180 days) before the specified date, then the analyst will be counted in. This definition ensures that the analyst is “actively” following the firm.

That is why my macro requires two arguments: DATE and WINDOW. This macro is used to answer such question—at a specified date, how many analysts are actively following Firm A, B, …?


Posted in SAS | 1 Comment

The art of regular expression

Regular expression is a powerful tool to do text search. It is the foundation of a lot of textual analysis research, though today’s textual analysis in computer science has gone far beyond text search. Regular expression operations are programming language independent. Any modern programming language supports regular expression operations well. So, if someone tells you that PERL is the best language to do text search (or textual analysis), that is plainly wrong.

Writing regular expression is work of art! You can find building blocks of regular expression here. I create this post to gather examples of regular expression that will solve certain text search questions. I will grow this post continuously.

Posted in Python | 1 Comment

Stata commands to change variable names or values of string variables to all lowercase

Stata is a case-sensitive application. Sometimes this will cause a trouble. So, we may want to change variable names or values of variables to all lowercase before we start processing data. This post gives a fast way to do this.

Change variable names to all lowercase

We need to use the command rename. Instead of renaming variables one at a time, we can rename all variables in a single command (thanks Steve):

A related post can be found here:

Change values of string variables to all lowercase

ustrlower(string_variable) or strlower(string_variable) will do the trick. Instead of applying ustrlower or strlower function to string variables one by one, we can benefit from lowercasing values of all string variables in a short loop. The following loop will first check the type of a variable. If it is a string variable, then change the value of the variable to all lowercase.


Posted in Stata | 2 Comments

Common commands to deal with date in Stata

egen compdatadate=eom(fiscalmonth fiscalyear)
format compdatadate %td

To be continued …

Posted in Stata | Leave a comment

Stata command to order tabulation result with only top values shown

tabulate varname command is handy in Stata, but sometimes it returns a too long result, if varname contains too many unique values.

The third-party command, groups, will solve the problem by showing top values only. Please use ssc install groups to install groups. The usage of group is very similar to tabulate. Here are some examples:


Posted in Stata | Leave a comment

Empower “and” and “or” in IF statement in Stata

Stata is a little bit awkward when using and and or in if statement, compared to SAS. For example:

In SAS, we can write if 2001 <= fyear <= 2010. But in Stata, we usually write: if fyear >= 2001 & fyear <= 2010.

In fact, Stata provides a handy inrange function. The above if statement can be written as: if inrange(fyear, 2001, 2010).

Similarly, Stata provides another inlist function. The syntax is inlist(z, a, b, ...), which returns 1 if z = a or z = b … In if statement, it is equivalent to if z = a | z = b | ...

Posted in Stata | 1 Comment