Sample code for “outreg” command in Stata

outreg is a time-saving and must-have command in Stata. It will generate a ready-for-use results table like this. I’m sure you will see what a relief this can give us.

outreg is not a built-in command and can be installed by issuing the following command:

ssc install outreg

The typical usage of outreg is:

x1, x2, x3 and x4 are independent variables that you want to report estimates for. Sometimes you may not want to report unimportant independent variables. If you include interaction terms in regressions like this:

regress y c.x1##c.x2

By the way, this is the highly recommended way to include interaction terms in regressions. If you want to report estimates for x1, x2, and x1 × x2, please use the following keep option:

keep(_cons x1 x2 c.x1#c.x2)

I list other frequently used options that control the appearance of the results table:

stats: b, se, t, or p. Usually b and the other statistics will be reported.

      • b: coefficient estimates
      • se: standard errors of estimate
      • t: t statistics for the test of b=0
      • p: p value of t statistics

nosubstat: I do not include this option in the sample code. This option will report two selected statistics (usually b and the other one) in two columns, rather than putting one (e.g., t statistics) below the other (e.g., b).

varlabels: I do not include this option in the sample code. This option will report variable labels, rather than variable names in the generated table. Sometimes variable labels is easier to understand than variable names. In this case, you can set labels for independent variables and turn on this option.

nosubstat and varlabels can be placed anywhere after the comma.

The complete help file for this command can be found here.

There is a similar command outreg2 that you can check. But I find outreg is good enough and works really well for me.

Posted in Stata | Leave a comment

Use Python to download data from the DTCC’s Swap Data Repository

I helped my friend to download data from the DTCC’s Swap Data Repository. I am not familiar with the data and just use this as a programming practice.

This article gives an introduction to the origin of the data: http://www.dtcc.com/news/2013/january/03/swap-data-repository-real-time

The Python script will:

  1. download the daily Credit zip files; and
  2. extract CSV from individual zip files and combine the content into a single huge CSV (size 220MB), which then can be imported into Stata or other statistical package.

As of April 22, 2016, there were around one million historical records. The data seems available from April 6, 2013 and missing sporadically from then on. The Python script will print the bad dates where the daily data is not available.

 

Posted in Data, Python | Leave a comment

Download FR Y-9C data from WRDS

WRDS currently populates FR Y-9C data quarter by quarter in individual datasets, like BHCF200803, BHCF200806, BHCF200809 and so on. WRDS has not stacked those individual datasets to formulate a single time-series dataset like COMPUSTAT.

There are two ways to overcome this:

  1. Use the web query on WRDS. The web query allows users to specify a date range and return a single time-series dataset.
  2. I wrote a SAS script which is equivalent to the web query but with more ease for future update. The code currently can accept a date range and download selected variables.

 

Posted in Data, SAS | 2 Comments

TAR-Style Word Template

I create a Word template that complies with The Accounting Review editorial style. My design philosophy is “simple but sufficient”. I do not like those templates that are heavy and fancy (e.g., macros everywhere).

This is just version 1. It is quite usable though. Download here.

Good luck to everyone who tries to publish a paper on The Accounting Review!!!

PS: I have lost my love for MathType. It drives me crazy for converting my equations to un-editable graphs over and over again. I start using Word’s built-in Equation Editor. But Microsoft apparently cannot make the font look right. Install STIX math font if you are as picky as I am. STIX develops a math font that makes equations in Word look a lot like Times New Roman. Just google “STIX math font”.

Posted in Learning Resources | 2 Comments

Use Python to download TXT-format SEC filings on EDGAR (Part II)

[Update on 2019-07-31] This post, together with its sibling post “Part I“, has been my most-viewed post since I created this website. However, the landscape of 10-K/Q filings has changed dramatically over the past decade, and the text-format filings are extremely unfriendly for researchers nowadays. I would suggest directing our research efforts to html-format filings with the help of BeautifulSoup. The other post deserves more attention.

[Update on 2017-03-03] SEC closed the FTP server permanently on December 30, 2016 and started to use a more secure transmission protocol—https. Since then I have received several requests to update the script. Here it is the new codes for Part II.

[Original Post] As I said in the post entitled “Part I“, we have to do two steps in order to download SEC filings on EDGAR:

  1. Find paths to raw text filings;
  2. Select what we want and bulk download from EDGAR using paths we have obtained in the first step.

Part I” elaborates the first step. This post shares Python codes for the second step.

In the first step, I save index files in a SQLite database as well as a Stata dataset. The index database includes all types of filings (e.g., 10-K and 10-Q). Select from the database the types that you want and export your selection into a CSV file, say “sample.csv”. To use the following Python codes, the format of the CSV file must look as follows (this example selects all 10-Ks of Apple Inc). Please note: both SQLite and Stata datasets contain an index column, and you have to delete that index column when exporting your selection into a CSV file.

Then we can let Python complete the bulk download task:

I do not take care of file directories of “sample.csv” and output raw text filings in the codes. You can modify by yourself. saveas = '-'.join([line[0], line[2], line[3]]) is used to name the output SEC filings. The current name is cik-form type-filing date.txt. Please move around these elements to accommodate your needs (thank Eva for letting me know a previous error here).

Posted in Data, Python | 59 Comments

Use Python to extract Intelligence Indexing fields in Factiva articles

First of all, I acknowledge that I benefit a lot from Neal Caren’s blog post Cleaning up LexisNexis Files. Thanks Neal.

Factiva (as well as LexisNexis Academic) is a comprehensive repository of newspapers, magazines, and other news articles. I first describe the data elements of a Factiva news article. Then I explain the steps to extract those data elements and write them into a more machine-readable table using Python.

Data Elements in Factiva Article

Each news article in Factiva, no matter how it looks like, contains a number of data elements. In Factiva’s terminology, those data elements are called Intelligence Indexing Fields. The following table lists the label and name for each data element (or, field) along with what is contained in each:

Field LabelField NameWhat It Contains
HDHeadlineHeadline
CRCredit InformationCredit Information (Example: Associated Press)
WCWord CountNumber of words in document
PDPublication DatePublication Date
ETPublication TimePublication Time
SNSource NameSource Name
SCSource CodeSource Code
EDEditionEdition of publication (Example: Final)
PGPagePage on which article appeared (Note: Page-One Story is a Dow Jones Intelligent Indexingª term)
LALanguageLanguage in which the document is written
CYCopyrightCopyright
LPLead ParagraphFirst two paragraphs of an article
TDTextText following the lead paragraphs
CTContactContact name to obtain additional information
RFReferenceNotes associated with a document
CODow Jones Ticker SymbolDow Jones Ticker Symbol
INIndustry CodeDow Jones Intelligent Indexingª Industry Code
NSSubject CodeDow Jones Intelligent Indexingª Subject Code
RERegion CodeDow Jones Intelligent Indexingª Region Code
IPCInformation Provider CodeInformation Provider Code
IPDInformation Provider DescriptorsInformation Provider Descriptors
PUBPublisher NamePublisher of information
ANAccession NumberUnique Factiva.com identification number assigned to each document

Please note that not every news article contains all those data elements, and that the table may not list all data elements used by Factiva (Factiva may make updates). Depending on which display option you select when downloading news articles from Factiva, you may not be able to see certain data elements. But they are there and used by Factiva to organize and structure its proprietary news article data.

How to Extract Data Elements in Factiva Article

flow

You can follow three steps outlined in the above diagram to extract data elements in news articles and for further processing (e.g., calculate tone of full text represented by both LP and TD element; or group by news subject, i.e., by NS element). I explain them one by one as follows.

Step 1: Download Articles from Factiva in RTF Format

It is a lot of pain to download a large number of news articles from Factiva: it is technically difficult to download articles in an automated fashion; you can only download 100 articles at a time, also those 100 articles cannot exceed the word count limit, i.e., 180,000. As a result, it requires a lot of tedious work if you want to gather tens of thousands news articles. While I can do nothing about both issues in this post, I can say a bit more about them.

Firstly, you may see some people discuss methods for automatic downloading (a so-called “webscraping” technique. See here). However, this needs more hacking after Factiva introduced CAPTCHA to determine whether or not the user is a human. You may not be familiar with the term “CAPTCHA”, but you must experience the circumstance where you are asked to input characters or numbers shown in an image before you can download a file or go to the next webpage. That is CAPTCHA. Both Factiva and LexisNexis Academic have introduced CAPTCHA to prohibit robotic downloading. Though CAPTCHA is not unbeatable, it requires advanced technique.

Secondly, the Factiva licence expressly prohibits data mining. However, the licence does not define clearly what constitutes data mining. I was informed that downloading a large number of articles in a short period of time would be red flagged as data mining. But the threshold speed set by Factiva is low and any trained and adept person can beat that threshold speed easily. If you are red flagged by Factiva, things could go ugly. So, do not be too fast, even this may slow down your research.

Let’s get back to the topic. When you manually download news articles from Factiva, the most important thing is to select the right display option. Please select the third one: Full Article/Report plus Indexing as indicated by the following graph:

Factiva

Then you have to download articles in RTF – Article Format, as indicated by the following graph:

Factiva2

After the download is completed, you will get an RTF document. If you open it, you will find news articles look like this:

Factiva3

The next step is to convert RTF to plain TXT, because Python can process TXT documents more easily. After Python finishes its job, the final product will be a table: each row of the table represents a news article; and each column of the table is a data element.

Step 2: Convert RTF to TXT

Well, this can surely be done by Python. But so far I have not written a Python program to do this. I will complete this “hole” when I have time. For my research, I simply take advantage of the convenience of the default text editor shipped with Mac OS, TextEdit. I select Format – Make Plain Text from the menu bar, and then save the document in TXT format. You can make this happen in an automatic fashion using Automator in Mac OS.

Step 3: Extract Data Elements and Save to a Table

This is where Python does the dirty work. To run the Python program correctly, please save the Python program in the directory where you put all plain TXT documents created in Step 2 before you run the program. This program will:

  1. Read in each TXT document;
  2. Extract data elements of each article and write them to an SQLite database;
  3. Export data to a CSV file for easy processing in other software such as Stata.

I introduce an intermediate step which writes data to an SQLite database, simply because this can facilitate manipulation of news article data using Python for other purposes. Of course, you can directly write data to a CSV file.

Posted in Python | 15 Comments

A loop of cross-sectional regressions for calculating abnormal accruals in Stata

I write a loop of cross-sectional regressions for calculating abnormal accruals. This program can be easily modified and replaced with Jones, modified Jones, or Dechow and Dichev model.

I add detailed comments in the program to help you prepare the input file.

Posted in Stata | 7 Comments

The impact of WRDS transition to the new WRDS Cloud server

WRDS has quietly started the transition from the old server to the new Cloud server. This move makes a lot of support documentation on the WRDS website outdated and misleading. That is why I think WRDS should direct its resources on continuously updating tutorials and manuals and providing more ready-to-use research macros and applications, instead of wasting money on website cosmetics as it did recently.

Now, among supporting documentation about accessing WRDS, only the following two are up-to-date:

The WRDS Cloud Manual
PC-SAS on the WRDS Cloud

All other documentation contains outdated information and may cause confusion and unexpected problems.

In its support documentation, WRDS refers to the old server as either WRDS Unix Server or WRDS Interactive Server (wrds3). The new server is called WRDS Cloud.

The address of the old server: wrds.wharton.upenn.edu 4016
The address of the new server: wrds-cloud.wharton.upenn.edu 4016

They are DIFFERENT! Users who are accessing WRDS using SSH and PC-SAS will be impacted by this transition.

PC-SAS users are familiar with the following statements:

PC-SAS users were able to use one of the eight SASTEMP directories on the server to store sizeable data files temporarily, and upload/download data files to/from their home directory (which would be /home/yourinstitution/youraccountname with 750M space limit). In addition, if you use SSH to log onto the old server, you will see the same home directory as using PC-SAS. As a result, if you uploaded a data file to your home directory via easy-to-use SSH File Transfer (an FTP-like app), you would be able to locate the file in your home directory during PC-SAS connections.

Now this has been changed. PC-SAS now (since August 25, 2015) connects through the WRDS Cloud, instead of the older Interactive Server (wrds3), EVEN IF YOU STILL SPECIFY %let wrds = wrds.wharton.upenn.edu 4016;. The consequences of this change are:

  • You are not able to use one of the eight SASTEMP directories by using PC-SAS. Instead, you are able to use a larger directory for your temporary data (500G shared by your institution), located at /scratch/yourinstitution. You are still able to access the eight SASTEMP directories if you log onto the old server by using SSH.
  • The WRDS Cloud gives you a new home directory, though its path remains /home/yourinstitution/youraccount (with a new 10G space limit). So if you use SSH to log onto the old server (as many users probably do if they are not aware of the server transition), you cannot see files that you create in your home directory during PC-SAS connections.

These two consequences may cause confusion for users who use both PC-SAS and SSH to access WRDS interchangeably. They may ask: “why cannot I use the temporary directory any more?” or “where is my files?”

To avoid any possible problem, users should use the new WRDS Cloud server consistently with either SSH or PC-SAS from now on. This means whenever you access WRDS, always use the new server address.

If you use PC-SAS, use the following statements:

If you use SSH, use the following command:
ssh youraccountname@wrds-cloud.wharton.upenn.edu

With the new WRDS Cloud server, you use a new command to run your SAS program in background in the SSH command line mode:
qsas yourprogram.sas

You can run multiple SAS programs concurrently this way (up to 5 concurrent jobs). If you prefer run your SAS programs sequentially, you need to write a SAS wrapper script and submit a batch job. You can find details here.

You can use qstat to browse your currently running job and get the job id. If you change your mind and want to terminate that job, you can type:
qdel yourjobid

WRDS is going to phase out the old server. The new WRDS Cloud is supposed to be more computationally powerful. Plus, the new WRDS server offers users a larger home directory and temporary directory. Therefore, it is time for users to migrate to the new WRDS Cloud server.

Posted in Learning Resources, SAS | 2 Comments

Rolling-window computation in SAS and Stata

SASers often find proc expand plus transformout very useful for rolling-window (or moving-window) computation. Stataers may wonder if there is a counter party in Stata. The answer is “yes”. The command in Stata is rolling. See the manual below:

http://www.stata.com/manuals13/tsrolling.pdf

The benefits of using  rolling in Stata comes from two facts:

  • Stata is superior to SAS in dealing with time-series or panel data. After a single-line command to define time-series or panel data (tsset), Stata can handle gaps in time series intelligently and automatically. In contrast, SAS users have to manually check gaps in time series. 90% of SAS codes using rolling-window transformation in accounting research do not have such gap check. This may generate incorrect inferences.
  • In Stata, rolling can be combined with any other command such as regress. Therefore rolling-window computation in Stata is more flexible.

However, proc expand plus transformout in SAS is insanely faster than rolling in Stata (by “insanely faster”, I mean maybe millions times faster). This is truly a deal breaker for Stata.

Therefore, the best solution to rolling-window computation is to use Stata to do the gap check and filling (tsfill) first, and then use SAS to do lightening rolling-window computation.

Posted in Learning Resources, SAS, Stata | Leave a comment

SAS macro for event study and beta

There are two macros on the List of WRDS Research Macros: EVTSTUDY and BETA, which may be often used.

I like the first one, written by Denys Glushkov. Denys’ codes are always elegant. I don’t like the second one because I believe it contains not minor mistakes and does a lot of unwanted calculation.

Since event study and beta calculation are just two sides of one thing, I make the following macro to output both event study results (e.g., CAR) and beta. My macro heavily borrows from Denys’ codes but differs in the following ways:

  1. I add beta to the final output. This is the main difference.
  2. Deny uses CRSP.DSIY to generate the trading calendar and market returns. I cannot see why he uses this dataset. The trouble is not every institution has the subscription to this dataset. Thus, I use a more accessible dataset CRSP.DSI instead (Thank Michael Shen for bringing this to my attention).
  3. I improve efficiency in generating related trading dates at the security-event level.
  4. I correct several errors in Denys’ macro: (a) his macro does not sort the input dataset by permno and event date, leading to a fatal error later on; and (b) I correct a few dataset/variable references.
  5. Deny’s macro switches off warning or error messages, which is inconvenient for debugging. I change this setting.

All changes are commented with /* CHANGE HERE */. I compare the results (CAR and beta) from using my macro and those from using a commercial package, EVENTUS (with the help of my friend who has the license to EVENTUS). The accuracy of my macro is assured (Note: EVENTUS does not take delisting returns by default).

Update: WRDS rolled out the event study web inquiry (so-called Event Study by WRDS). I recently checked the accuracy of that product. To my surprise, the accuracy is unsatisfactory, if not terrible.

 

Posted in SAS | 1 Comment