EDGAR index files in Stata dataset (from 1993 Q1 to March 2, 2017)

SEC makes all EDGAR filings publicly available. We can download all 10-Ks, 10-Qs, 8-Ks filed since 1993. However, SEC makes this far away from just a few mouse clicks (in order to reduce the server load and avoid the possible abuse I guess). To download EDGAR filings, we have to download EDGAR index files first to get the full path of each 10-K, 10-Q, 8-K, etc. We cannot download any file without the full path information. See technical details here.

I downloaded all EDGAR index files and converted them into Stata datasets. You can download here: Stata format (1993–2000); Stata format (2001–2005); Stata format (2006–2010); Stata format (2011–2015); Stata format (2016–2019/03/16).

If you want to know how I do this, please read my another blog here.

Posted in Data | 16 Comments

Link RSSD with PERMCO

If you are working on bank holding company data, such as FR-Y9C, you may need to link the unique identifier (RSSD) in the data to the unique identifier (PERMCO) in CRSP.

Federal Reserve Bank of New York provides such a link table.

Posted in Data | 7 Comments

My favorite econometrics textbook

The website http://www.econometricsbooks.com discusses popular econometrics textbooks. I find the information very useful.

My favorite textbook is Introductory Econometrics: A Modern Approach. I would call it an intermediate level textbook. It discusses almost every topic that we will use in accounting research. I also like Basic Econometrics. Although it is an introductory level textbook, it explains intuition behind theories very well, which I find very helpful.

Of course, Econometrics Analysis is classic. But I would like to categorize it as “intimidating” level. My professor told me that he read through it many times and I just said “Wow”.

Posted in Learning Resources | Leave a comment

Use Python to download TXT-format SEC filings on EDGAR (Part I)

[Update on 2019-07-31] This post, together with its sibling post “Part II“, has been my most-viewed post since I created this website. However, the landscape of 10-K/Q filings has changed dramatically over the past decade, and the text-format filings are extremely unfriendly for researchers nowadays. I would suggest directing our research efforts to html-format filings with the help of BeautifulSoup. The other post deserves more attention.

[Update on 2018-10-06] As I acknowledged in the very first edition of this post, I borrowed some codes from Edouard Swiac’s Python module “python-edgar” (version: 1.0). Edouard kindly informed me that he had updated his module (see his GitHub page). The major updates to his module include: (1) he migrated the file download from FTP to HTTPS and (2) added parallel downloads so now it is faster to rebuild the full index, especially if going all the way to 1993. My initial thoughts about his updated module is that it provides more flexibility and should be more robust than mine. Thank you Edouard for your work!

[Update on 2017-03-03] SEC closed the FTP server permanently on December 30, 2016 and started to use a more secure transmission protocol—https. So the description about the FTP server in the original post is not applicable any more (but the basic idea about the URLs to raw text filings remain unchanged.) Since then I have received several requests to update the script. Here it is the new script for Part I.

The technical details may be too boring to most people. So, I provide multiple downloadable Stata datasets that include all index files from 1993 Q1 to October 6, 2018.

Stata format (1993–2000); Stata format (2001–2005); Stata format (2006–2010); Stata format (2011–2015); Stata format (2016–2019/03/16)

[Original Post] We know that SEC makes company filings (e.g.,10-Ks, 10-Qs and 8-Ks) publicly available on EDGAR. The web search interface is convenient, but we may need to bulk download raw text filings. SEC provides an anonymous EDGAR FTP server to access raw text filings. Usually, if we know the path or URL to a file on an FTP server, we can easily use an Internet browser or an FTP software to connect to the server and download the file. For example, if we navigate a bit on the EDGAR FTP server, we can find the path to the file “master.idx” as follows:

ftp://ftp.sec.gov/edgar/full-index/2015/QTR4/master.idx

Copy the path into an Internet browser or an FTP software, we can download the file directly.

In the above example, we can find the path to “master.idx” by navigating on the EDGAR FTP server. But we cannot find any path to any raw text filing. In other words, paths to raw text filings are not visible by simply looking into the EDGAR FTP server. SEC purposely hides paths to raw text filings to reduce server load and avoid data abuse.

In order to download SEC filings on EDGAR, we have to:

  1. Find paths to raw text filings;
  2. Select what we want and bulk download raw text filings from the EDGAR FTP server using paths we have obtained in the first step.

This post describes the first step, and I elaborate the second step in another post.

SEC stores all path information in index files. See technical details here. Let’s take a snap shot of an index file:

The last field on a line in the main body of the index file shows the path to a real raw text filing. What we have to do in the first step is to download and parse all index files and write the content into a database. Then in the second step, we can execute any query into the database (e.g., select certain form type or certain period of time) and download raw text filings using selected paths.

I write the following Python program to execute the first step. This program borrows some codes from Edouard Swiac’s Python module “python-edgar” (version: 1.0). Please see his package information page here.

Please note: my program stores all paths in an SQLite database. I personally like the lightweight database product very much. The last few lines of my program transfer data from the SQLite database to an Stata dataset for users who are not familiar with SQLite. To do so, I use two Python modules: pandas and sqlalchemy which you have to install using pip command on your own. Please google documentations of SQLite, Pandas, and SQLAchemy if you have installation problems. I am using Python 3.x in all my Python posts.

I find two articles explain how to use R and Perl to achieve the same functionality. I include the links (R or Perl) for users who are more comfortable with R or Perl.

Posted in Data, Python | 67 Comments

Which is better for archival accounting research, SAS or Stata?

The short answer is: we need both.

When we do archival accounting research, we often need to merge data across various databases, such as COMPUSTAT, CRSP, and EXECUCOMP. Stata can by no means beat SAS in this regard. SAS supports full SQL (a language specifically designed for database query), whereas Stata only has a “baby” merge functionality (really, “baby”!).

When we get all data and start to do data analysis, such as statistics, correlations, and regressions, this work can be done much more efficiently by Stata than by SAS. Not mention many handy packages developed by Stata user community, an ecosystem like Apple’s App Store.

So, the best strategy is to use different software at different research stages. The learning cost is not as high as this strategy sounds like—Stata is easy to learn (in fact, way easier than SAS). Anyone can command Stata very quickly.

I have attached a PPT on this topic in a bit more details (BBLG SAS vs STATA).

Finally, I use the following quote to conclude (thanks to WRDS.US Tutorials Series):

Even though data management and regression can be performed in SAS, some users prefer to use another package to do the ‘final’ steps. For example, SAS can be used to retrieve and manage the data. The final dataset created in SAS can then be converted for example to STATA format (using StataTrans). STATA can then be used to create the tables with descriptive statistics, correlation tables, and perform (final) regressions and other statistical tests.

Posted in Learning Resources, SAS, Stata | 2 Comments