Link FactSet and CRSP

Both FactSet and CRSP have identifier files that contain primary identifiers at the entity level and the security level (note: an entity may issue multiple securities) and provide a map between their primary identifiers and all other historical identifiers (e.g., CUSIP, Ticker). We can use these identifiers to construct a link table.

Relevant CRSP Variables

CRSP STOCKNAMES file provides a map between CRSP permanent identifiers and all historical CUSIPs, Company Names, and Exchange Tickers, along with their effective date ranges.

  1. PERMCO is the unique identifier for CRSP entities. PERMCO does not change over time even if the company changes its name, CUSIP and exchange ticker.
  2. PERMNO is the unique identifier for CRSP securities. PERMNO does not change over time too.
  3. COMNAM is the company name in CRSP at a specific point in time.
  4. NCUSIP is the 8-digit historical CUSIP for CRSP securities. A change in CUSIP, for example, could be triggered by any change in the security, including non-fundamental events such as splits and company name changes. CRSP also has a header variable named “CUSIP” that only reflects the most recent CUSIP for the entire time series.
  5. NAMEDT and NAMEENDDT are the first and last effective dates of each set of name structure, which consists of CUSIP, Company Name, Exchange Code, Exchange Ticker, Share Class, and SIC Code. Each PERMNO has at least one name structure in STOCKNAMES file.

Relevant FactSet Variables

FactSet SYM_V1_SYM_CUSIP_HIST file provides a map between FactSet permanent identifiers and all historical CUSIPs, along with their effective date ranges.

  1. FSYM_ID is the unique identifier for FactSet securities (similar to PERMNO in CRSP). FSYM_ID does not change over time. In another identifier file, ENT_V1_ENT_SCR_SEC_ENTITY_HIST, FactSet provides a map between FactSet securities and their issuing entities (uniquely identified by FACTSET_ENTITY_ID, which does not change over time like PERMCO in CRSP).
  2. CUSIP is the 9-digit historical CUSIP for FactSet securities. FactSet also has a header variable named “MOST_RECENT” that only reflects the most recent CUSIP for the entire time series.
  3. START_DATE and END_DATE are the first and last effective dates of each set of name structure.

CUSIP Method to Link FactSet FSYM_ID and CRSP PERMNO

Both FactSet FSYM_ID and CRSP PERMNO are used to uniquely identify securities. We can link the two using the associated historical CUSIPs, the common identifier in both FactSet and CRSP.

The following two-part code generates a link table that provides a map between FactSet FSYM_ID and CRSP PERMNO, along with their effective date ranges (START_DATE AND END_DATE). The first part uses SAS, and the second uses Stata. An SAS expert should be able to complete the task in SAS entirely. I use Stata for the second part simply because a handy command (tsspell) is available in Stata.

First-part SAS code:

Second-part Stata code:

Please set up LIBNAME and directories in SAS and Stata to reference to relevant datasets properly. Please use ssc install tsspell, replace in Stata to install the third-party command.

Please note END_DATE in the link table is the most recent update date of CRSP STOCKNAMES file (rather than a missing value) if the link is still active. Therefore, please use the most recent CRSP STOCKNAMES file to avoid any unexpected error.

Perhaps because CRSP only keeps historical CUSIP data when the securities are traded in major exchanges, whereas FactSet extracts data from a much large security universe, I find that the link table only covers a small portion of the FactSet universe – only 36,560 FSYM_IDs can be linked to a PERMNO whereas there are 310,234 FSYM_IDs in FactSet with FACTSET_ENTITY_IDs.

Thanks to two WRDS articles: Merging CRSP and Compustat Data and Linking IBES and CRSP Data.


Posted in Data | 7 Comments

A test on Stata running speed on MacBook Pro (M1 Pro chip) and old Macs

I ran a test on Stata running speed on my newest MacBook Pro (14-inch, 2021) and two old Macs—iMac (27-inch, 2019) and MacBook Pro (16-inch, 2019).

Technical specifications:

  • MacBook Pro (14-inch, 2021): CPU Apple M1 Pro (10-core), memory 16G, SSD 1T
  • iMac (27-inch, 2019): CPU Intel i9-9900K (3.6GHz 8-core), memory 64G, SSD 1T
  • MackBook Pro (16-inch, 2019): CPU Intel i9-9880H (2.3 GHz 8-core), memory 16G, SSD 1T

The Stata code I tested includes some simple commands only, such as loop, egen, and merge. As I’m using Stata SE, the single-core performance of the three Macs should be the most important determinant.

MacBook Pro (2021) took 65.71 seconds to complete the test, while iMac (2019) took 96.59 seconds and MacBook Pro (2019) 117.16 seconds.

I’m surprised to see this result. The M1 Pro chip seems a beast!


Posted in Stata | 6 Comments

My thoughts on Python for accounting research

There is a temptation for accounting PhD students to invest in learning Python. However, I would recommend accounting PhD students focus more on SAS + Stata than on Python in their first year for a few practical and technical reasons:

  1. Although Python looks trendy, still few accounting researchers master this language. Research is oftentimes a co-authorship. If the research code can only be understood by one person in a research team, it will be counter productive.
  2. Because not many researchers use Python yet, Python users cannot take advantage of many ready-to-use code written in SAS which has a larger user base.
  3. Debugging in Python is harder than in commercial software such as SAS and Stata. Freeware sounds appealing, but I always find I end up spending more time when using freeware which typically lacks detailed and excellent help documentation.
  4. Like Stata, at least for now, Python cannot manipulate large datasets (limited by the memory size of computer). If you are going to handle mega datasets such as intra-day bond/stock transaction data, Python will be a pain.

SAS + Stata is a solid solution I advocate. I use SAS for SQL and Stata for everything else. Python is suitable for certain research topics such as textual analysis. If you decide to invest in learning Python, you can count on the following learning resources:

  1. The Python Tutorial on the official Python website: – easy for a fast read.
  2. There are tons of Python courses on Udemy: – year-round discounted price is $13 per course.
  3. Anand et al. authored an excellent methodology article – Vic Anand, Khrystyna Bochkay, Roman Chychyla and Andrew Leone (2020 isbn), “Using Python for Text Analysis in Accounting Research (forthcoming)”, Foundations and Trends ® in Accounting: Vol. xx, No. xx, pp 1–18. DOI: 10.1561/XXXXXXXXX.
Posted in Learning Resources, Python | 1 Comment

Use Stata to do propensity score matching (PSM)

Most propensity score matching (PSM) examples are using cross-sectional data instead of panel data. However, accounting research often uses panel data (i.e., observations with two subscripts i and t, e.g. firm-years) in a difference-in-differences (DID) research design, so that there are two dummy variables, TREATMENT and POST, in the following regression:


where TREATMENT often indicates an event and POST indicates before or after that event. It is common that we do a one-to-one matching, and it arguably makes more sense that such one-to-one matching is done by using selected pre-event and firm-level variables (Xs). The pre-event variables can be measured either at the most recent date before the event (e.g., the total assets at the most recent quarter end before the event) or at the average over the pre-event period (e.g., the average total assets in the four quarters preceding the event).

We need to do a probit or logit regression for PSM:

TREATMENT = X1 + X2 + …

The single nearest neighbour in terms of propensity score will be selected as the matched control, and then DID regressions can be done subsequently.

psmatch2 is a user-written module to find out matched controls using PSM. First, we need to install the module in Stata by typing:

Then the following command should work in most cases:

There are three options in the above command:

  • noreplacement – perform one-to-one matching without replacement. I would add this option to find more unique matched controls.
  • logit – use logit instead of the default probit to estimate the propensity score. I am indifference on this option
  • descending – more details about this option can be found in Lunt (2014). The author concludes that “in the absence of a caliper (another option I would omit to maximize matched controls), the descending method provides the best matches, particularly when there is a large separation between exposed (treated) and unexposed (untreated) subjects.” So, I would add this option.

psmatch2 creates a number of variables, of which the following two are the most useful for subsequent DID regressions:

  • _id – In the case of one-to-one and nearest-neighbors matching, a new identifier created for all observations.
  • _n1 – In the case of one-to-one and nearest-neighbors matching, for every treatment observation, it stores the new identifier (_id) of the matched control observation.

There is a limitation with psmatch2. Sometimes we may want the treatment and its matched control to have the same value on a variable X. For example, we may want the treatment and its matched control to be drawn from the same industry, or both to be male or female. psmatch2 seems incapable on this. Some imperfect solutions are discussed in this post (i.e., adding i.INDUSTRY or i.GENDER in Xs). In contrast, the PSMATCH procedure in SAS seems to have a perfect solution by providing the EXACT= statement (although I don’t know if SAS implements a stratification method. If yes, psmatch2 can also do so by tweaking its options.) More details about the SAS procedure can be found in this manual.

Another conclusion is that psmatch2 is preferable to Stata’s built-in command teffects, because we need the variables generated by psmatch2 (e.g., _id and _n1) for subsequent DID regressions, while teffects do not return such variables.

This article aims at providing a quick how-to and thus ignore some necessary steps for PSM, such as assessing covariate’s balance. More rigorous discussion on PSM in accounting research can be found in Shipman, Swanquist, and Whited (2017).

I benefit from the following articles and Thanks to both authors:

Posted in Stata | 4 Comments

Export a SAS dataset to Stata with all variable names converted to lowercase

I use both SAS and Stata and often need to transfer data between the two. SAS is case-sensitive and Stata is not. I always prefer working with lowercase variable names in Stata. The following code is used to export a SAS dataset to Stata with all variables names converted to lowercase.

The macro I use is borrowed from Adrian’s work. Thanks Adrian.

A related post can be found here:


Posted in SAS | 3 Comments

Clean up TRACE Enhanced dataset

WRDS provides an excellent manual (link) and SAS code (link) for cleaning up the raw TRACE Enhanced bond transaction data, primarily based on the work done by Dick‐Nielsen, Jens, How to Clean Enhanced TRACE Data (December 3, 2014). Available at SSRN: Dick‐Nielsen also provides his SAS code for the clean-up. Several papers refer to his cleaning steps.

Both WRDS and Dick-Nielsen’s codes remove cancellations, corrections, reversals, and double counting of agency trades. Dick-Nielsen’s code provides a few more options, e.g., remove commissioned trades.

Posted in SAS | Leave a comment

Stata command to perform Chow test

A Chow test is simply a test of whether the coefficients estimated over one group of the data are equal to the coefficients estimated over another.

I find two useful articles from Stata’s official website:

Can you explain Chow tests?
How can I compute the Chow test statistic?

Suppose we do following regressions separately in two groups:

regress y x1 x2 if group==1 and regress y x1 x2 if group==2

Then following commands will test the equality of coefficients on x1 and x2:

ge g2=(group==2)
regress y c.x1##i.g2 c.x2##i.g2
contrast g2 g2#c.x1 g2#c.x2, overall

Stata’s official website gives an example of the output:

In this example, to test the equality of coefficients on x1 and x2, 6.06 and 2.80 are the F-stats that we are looking for.

Posted in Stata | 5 Comments

SAS macro to count the number of analysts following a firm

This macro is used to count the number of analysts who followed a specific firm. Although this is a commonly used measure in literature, prior studies often give a vague description on what they do. The question is—what does “analysts following a firm” really mean?

First, it is only meaningful to count the number at a specified date.

Second, how to define “an analyst is actually following a firm”? I use the following definition: if an analyst issued any forecast (EPS or stock price or sales, anything) within a certain window (e.g., 180 days) before the specified date, then the analyst will be counted in. This definition ensures that the analyst is “actively” following the firm.

That is why my macro requires two arguments: DATE and WINDOW. This macro is used to answer such question—at a specified date, how many analysts are actively following Firm A, B, …?


Posted in SAS | 1 Comment

The art of regular expression

Regular expression is a powerful tool to do text search. It is the foundation of a lot of textual analysis research, though today’s textual analysis in computer science has gone far beyond text search. Regular expression operations are programming language independent. Any modern programming language supports regular expression operations well. So, if someone tells you that PERL is the best language to do text search (or textual analysis), that is plainly wrong.

Writing regular expression is work of art! You can find building blocks of regular expression here. I create this post to gather examples of regular expression that will solve certain text search questions. I will grow this post continuously.

Posted in Python | 1 Comment

Stata commands to change variable names or values of string variables to all lowercase

Stata is a case-sensitive application. Sometimes this will cause a trouble. So, we may want to change variable names or values of variables to all lowercase before we start processing data. This post gives a fast way to do this.

Change variable names to all lowercase

We need to use the command rename. Instead of renaming variables one at a time, we can rename all variables in a single command (thanks Steve):

A related post can be found here:

Change values of string variables to all lowercase

ustrlower(string_variable) or strlower(string_variable) will do the trick. Instead of applying ustrlower or strlower function to string variables one by one, we can benefit from lowercasing values of all string variables in a short loop. The following loop will first check the type of a variable. If it is a string variable, then change the value of the variable to all lowercase.


Posted in Stata | 2 Comments