Use Python to download TXT-format SEC filings on EDGAR (Part II)

[Update on 2019-07-31] This post, together with its sibling post “Part I“, has been my most-viewed post since I created this website. However, the landscape of 10-K/Q filings has changed dramatically over the past decade, and the text-format filings are extremely unfriendly for researchers nowadays. I would suggest directing our research efforts to html-format filings with the help of BeautifulSoup. The other post deserves more attention.

[Update on 2017-03-03] SEC closed the FTP server permanently on December 30, 2016 and started to use a more secure transmission protocol—https. Since then I have received several requests to update the script. Here it is the new codes for Part II.

[Original Post] As I said in the post entitled “Part I“, we have to do two steps in order to download SEC filings on EDGAR:

  1. Find paths to raw text filings;
  2. Select what we want and bulk download from EDGAR using paths we have obtained in the first step.

Part I” elaborates the first step. This post shares Python codes for the second step.

In the first step, I save index files in a SQLite database as well as a Stata dataset. The index database includes all types of filings (e.g., 10-K and 10-Q). Select from the database the types that you want and export your selection into a CSV file, say “sample.csv”. To use the following Python codes, the format of the CSV file must look as follows (this example selects all 10-Ks of Apple Inc). Please note: both SQLite and Stata datasets contain an index column, and you have to delete that index column when exporting your selection into a CSV file.

Then we can let Python complete the bulk download task:

I do not take care of file directories of “sample.csv” and output raw text filings in the codes. You can modify by yourself. saveas = '-'.join([line[0], line[2], line[3]]) is used to name the output SEC filings. The current name is cik-form type-filing date.txt. Please move around these elements to accommodate your needs (thank Eva for letting me know a previous error here).

Posted in Data, Python | 59 Comments

Use Python to extract Intelligence Indexing fields in Factiva articles

First of all, I acknowledge that I benefit a lot from Neal Caren’s blog post Cleaning up LexisNexis Files. Thanks Neal.

Factiva (as well as LexisNexis Academic) is a comprehensive repository of newspapers, magazines, and other news articles. I first describe the data elements of a Factiva news article. Then I explain the steps to extract those data elements and write them into a more machine-readable table using Python.

Data Elements in Factiva Article

Each news article in Factiva, no matter how it looks like, contains a number of data elements. In Factiva’s terminology, those data elements are called Intelligence Indexing Fields. The following table lists the label and name for each data element (or, field) along with what is contained in each:

Field LabelField NameWhat It Contains
HDHeadlineHeadline
CRCredit InformationCredit Information (Example: Associated Press)
WCWord CountNumber of words in document
PDPublication DatePublication Date
ETPublication TimePublication Time
SNSource NameSource Name
SCSource CodeSource Code
EDEditionEdition of publication (Example: Final)
PGPagePage on which article appeared (Note: Page-One Story is a Dow Jones Intelligent Indexingª term)
LALanguageLanguage in which the document is written
CYCopyrightCopyright
LPLead ParagraphFirst two paragraphs of an article
TDTextText following the lead paragraphs
CTContactContact name to obtain additional information
RFReferenceNotes associated with a document
CODow Jones Ticker SymbolDow Jones Ticker Symbol
INIndustry CodeDow Jones Intelligent Indexingª Industry Code
NSSubject CodeDow Jones Intelligent Indexingª Subject Code
RERegion CodeDow Jones Intelligent Indexingª Region Code
IPCInformation Provider CodeInformation Provider Code
IPDInformation Provider DescriptorsInformation Provider Descriptors
PUBPublisher NamePublisher of information
ANAccession NumberUnique Factiva.com identification number assigned to each document

Please note that not every news article contains all those data elements, and that the table may not list all data elements used by Factiva (Factiva may make updates). Depending on which display option you select when downloading news articles from Factiva, you may not be able to see certain data elements. But they are there and used by Factiva to organize and structure its proprietary news article data.

How to Extract Data Elements in Factiva Article

flow

You can follow three steps outlined in the above diagram to extract data elements in news articles and for further processing (e.g., calculate tone of full text represented by both LP and TD element; or group by news subject, i.e., by NS element). I explain them one by one as follows.

Step 1: Download Articles from Factiva in RTF Format

It is a lot of pain to download a large number of news articles from Factiva: it is technically difficult to download articles in an automated fashion; you can only download 100 articles at a time, also those 100 articles cannot exceed the word count limit, i.e., 180,000. As a result, it requires a lot of tedious work if you want to gather tens of thousands news articles. While I can do nothing about both issues in this post, I can say a bit more about them.

Firstly, you may see some people discuss methods for automatic downloading (a so-called “webscraping” technique. See here). However, this needs more hacking after Factiva introduced CAPTCHA to determine whether or not the user is a human. You may not be familiar with the term “CAPTCHA”, but you must experience the circumstance where you are asked to input characters or numbers shown in an image before you can download a file or go to the next webpage. That is CAPTCHA. Both Factiva and LexisNexis Academic have introduced CAPTCHA to prohibit robotic downloading. Though CAPTCHA is not unbeatable, it requires advanced technique.

Secondly, the Factiva licence expressly prohibits data mining. However, the licence does not define clearly what constitutes data mining. I was informed that downloading a large number of articles in a short period of time would be red flagged as data mining. But the threshold speed set by Factiva is low and any trained and adept person can beat that threshold speed easily. If you are red flagged by Factiva, things could go ugly. So, do not be too fast, even this may slow down your research.

Let’s get back to the topic. When you manually download news articles from Factiva, the most important thing is to select the right display option. Please select the third one: Full Article/Report plus Indexing as indicated by the following graph:

Factiva

Then you have to download articles in RTF – Article Format, as indicated by the following graph:

Factiva2

After the download is completed, you will get an RTF document. If you open it, you will find news articles look like this:

Factiva3

The next step is to convert RTF to plain TXT, because Python can process TXT documents more easily. After Python finishes its job, the final product will be a table: each row of the table represents a news article; and each column of the table is a data element.

Step 2: Convert RTF to TXT

Well, this can surely be done by Python. But so far I have not written a Python program to do this. I will complete this “hole” when I have time. For my research, I simply take advantage of the convenience of the default text editor shipped with Mac OS, TextEdit. I select Format – Make Plain Text from the menu bar, and then save the document in TXT format. You can make this happen in an automatic fashion using Automator in Mac OS.

Step 3: Extract Data Elements and Save to a Table

This is where Python does the dirty work. To run the Python program correctly, please save the Python program in the directory where you put all plain TXT documents created in Step 2 before you run the program. This program will:

  1. Read in each TXT document;
  2. Extract data elements of each article and write them to an SQLite database;
  3. Export data to a CSV file for easy processing in other software such as Stata.

I introduce an intermediate step which writes data to an SQLite database, simply because this can facilitate manipulation of news article data using Python for other purposes. Of course, you can directly write data to a CSV file.

Posted in Python | 15 Comments

A loop of cross-sectional regressions for calculating abnormal accruals in Stata

I write a loop of cross-sectional regressions for calculating abnormal accruals. This program can be easily modified and replaced with Jones, modified Jones, or Dechow and Dichev model.

I add detailed comments in the program to help you prepare the input file.

Posted in Stata | 7 Comments

The impact of WRDS transition to the new WRDS Cloud server

WRDS has quietly started the transition from the old server to the new Cloud server. This move makes a lot of support documentation on the WRDS website outdated and misleading. That is why I think WRDS should direct its resources on continuously updating tutorials and manuals and providing more ready-to-use research macros and applications, instead of wasting money on website cosmetics as it did recently.

Now, among supporting documentation about accessing WRDS, only the following two are up-to-date:

The WRDS Cloud Manual
PC-SAS on the WRDS Cloud

All other documentation contains outdated information and may cause confusion and unexpected problems.

In its support documentation, WRDS refers to the old server as either WRDS Unix Server or WRDS Interactive Server (wrds3). The new server is called WRDS Cloud.

The address of the old server: wrds.wharton.upenn.edu 4016
The address of the new server: wrds-cloud.wharton.upenn.edu 4016

They are DIFFERENT! Users who are accessing WRDS using SSH and PC-SAS will be impacted by this transition.

PC-SAS users are familiar with the following statements:

PC-SAS users were able to use one of the eight SASTEMP directories on the server to store sizeable data files temporarily, and upload/download data files to/from their home directory (which would be /home/yourinstitution/youraccountname with 750M space limit). In addition, if you use SSH to log onto the old server, you will see the same home directory as using PC-SAS. As a result, if you uploaded a data file to your home directory via easy-to-use SSH File Transfer (an FTP-like app), you would be able to locate the file in your home directory during PC-SAS connections.

Now this has been changed. PC-SAS now (since August 25, 2015) connects through the WRDS Cloud, instead of the older Interactive Server (wrds3), EVEN IF YOU STILL SPECIFY %let wrds = wrds.wharton.upenn.edu 4016;. The consequences of this change are:

  • You are not able to use one of the eight SASTEMP directories by using PC-SAS. Instead, you are able to use a larger directory for your temporary data (500G shared by your institution), located at /scratch/yourinstitution. You are still able to access the eight SASTEMP directories if you log onto the old server by using SSH.
  • The WRDS Cloud gives you a new home directory, though its path remains /home/yourinstitution/youraccount (with a new 10G space limit). So if you use SSH to log onto the old server (as many users probably do if they are not aware of the server transition), you cannot see files that you create in your home directory during PC-SAS connections.

These two consequences may cause confusion for users who use both PC-SAS and SSH to access WRDS interchangeably. They may ask: “why cannot I use the temporary directory any more?” or “where is my files?”

To avoid any possible problem, users should use the new WRDS Cloud server consistently with either SSH or PC-SAS from now on. This means whenever you access WRDS, always use the new server address.

If you use PC-SAS, use the following statements:

If you use SSH, use the following command:
ssh youraccountname@wrds-cloud.wharton.upenn.edu

With the new WRDS Cloud server, you use a new command to run your SAS program in background in the SSH command line mode:
qsas yourprogram.sas

You can run multiple SAS programs concurrently this way (up to 5 concurrent jobs). If you prefer run your SAS programs sequentially, you need to write a SAS wrapper script and submit a batch job. You can find details here.

You can use qstat to browse your currently running job and get the job id. If you change your mind and want to terminate that job, you can type:
qdel yourjobid

WRDS is going to phase out the old server. The new WRDS Cloud is supposed to be more computationally powerful. Plus, the new WRDS server offers users a larger home directory and temporary directory. Therefore, it is time for users to migrate to the new WRDS Cloud server.

Posted in Learning Resources, SAS | 2 Comments

Rolling-window computation in SAS and Stata

SASers often find proc expand plus transformout very useful for rolling-window (or moving-window) computation. Stataers may wonder if there is a counter party in Stata. The answer is “yes”. The command in Stata is rolling. See the manual below:

http://www.stata.com/manuals13/tsrolling.pdf

The benefits of using  rolling in Stata comes from two facts:

  • Stata is superior to SAS in dealing with time-series or panel data. After a single-line command to define time-series or panel data (tsset), Stata can handle gaps in time series intelligently and automatically. In contrast, SAS users have to manually check gaps in time series. 90% of SAS codes using rolling-window transformation in accounting research do not have such gap check. This may generate incorrect inferences.
  • In Stata, rolling can be combined with any other command such as regress. Therefore rolling-window computation in Stata is more flexible.

However, proc expand plus transformout in SAS is insanely faster than rolling in Stata (by “insanely faster”, I mean maybe millions times faster). This is truly a deal breaker for Stata.

Therefore, the best solution to rolling-window computation is to use Stata to do the gap check and filling (tsfill) first, and then use SAS to do lightening rolling-window computation.

Posted in Learning Resources, SAS, Stata | Leave a comment

SAS macro for event study and beta

There are two macros on the List of WRDS Research Macros: EVTSTUDY and BETA, which may be often used.

I like the first one, written by Denys Glushkov. Denys’ codes are always elegant. I don’t like the second one because I believe it contains not minor mistakes and does a lot of unwanted calculation.

Since event study and beta calculation are just two sides of one thing, I make the following macro to output both event study results (e.g., CAR) and beta. My macro heavily borrows from Denys’ codes but differs in the following ways:

  1. I add beta to the final output. This is the main difference.
  2. Deny uses CRSP.DSIY to generate the trading calendar and market returns. I cannot see why he uses this dataset. The trouble is not every institution has the subscription to this dataset. Thus, I use a more accessible dataset CRSP.DSI instead (Thank Michael Shen for bringing this to my attention).
  3. I improve efficiency in generating related trading dates at the security-event level.
  4. I correct several errors in Denys’ macro: (a) his macro does not sort the input dataset by permno and event date, leading to a fatal error later on; and (b) I correct a few dataset/variable references.
  5. Deny’s macro switches off warning or error messages, which is inconvenient for debugging. I change this setting.

All changes are commented with /* CHANGE HERE */. I compare the results (CAR and beta) from using my macro and those from using a commercial package, EVENTUS (with the help of my friend who has the license to EVENTUS). The accuracy of my macro is assured (Note: EVENTUS does not take delisting returns by default).

Update: WRDS rolled out the event study web inquiry (so-called Event Study by WRDS). I recently checked the accuracy of that product. To my surprise, the accuracy is unsatisfactory, if not terrible.

 

Posted in SAS | 1 Comment

Use Python to calculate the tone of financial articles

[Update on 2019-03-01] I completely rewrite the Python program. The updates include:

  • I include two domain-specific dictionaries: Loughran and McDonald’s and Henry’s dictionaries, and you can choose which dictionary to use.
  • I add negation check as suggested by Loughran and McDonald (2011). That is, any occurrence of negate words (e.g., isn’t, not, never) within three words preceding a positive word will flip that positive word into a negative one. Negation check only applies to positive words because Loughran and McDonald (2011) suggest that double negation (i.e., a negate word precedes a negative word) is not common. I expand their negate word list though, since theirs seem incomplete. In my sample of 90,000+ press releases, negation check finds that 5.7% of press releases have positive word(s) with a preceding negate word.

Please note:

  • The Python program first transform an article into a bag of words in their original order. Different research questions may define “word” differently. For example, some research questions only look at alphabetic words (i.e., remove all numbers in an article). I use this definition in the following Python program. But you may want to change this to suit your research question. In addition, there are many nuances in splitting sentences into words. The splitting method in the following Python program is simple but imperfect of course.
  • To use the Python program, you have to know how to assign the full text of an article to the variable article (using a loop) and how to output the results into a database-like file (Sqlite or CSV).

I acknowledge the work done by C.J. Hutto (see his work at GitHub).

[Original Post] I find two internet resources for this task (thank both authors):

The first solution is way more efficient than the second, but the second is more straightforward. The first needs extra knowledge of PostgreSQL and R besides Python. I borrow from the two resources and write the Python code below.

Please note, to use the Python code, you have to know how to assign the full text of an article of interest to the variable text, and how to output the total word count and the counts of positive/negative words in text.

In the first part of the code, I read the dictionary or the word list into a Python dictionary variable. The word list used here is supposed to be a .txt file and in the following format:

For accounting and finance research, a commonly used positive/negative word list was developed by Bill McDonald. See his website.

In the second part of the code, I create regular expressions that are used to find occurrences of positive/negative words. The last few lines of codes are used to get the counts of positive/negative words in the text.

Posted in Python | 14 Comments

How to remove duplicate GVKEY-DATADATE when using Compustat Annual (FUNDA) and Quarterly (FUNDQ)?

The annual data (FUNDA) is easy to deal with, we just need to apply the following conditions:

indfmt=="INDL" & datafmt=="STD" & popsrc=="D" & consol=="C"

If we have converted FUNDA to Stata format, the uniqueness of GVKEY-DATADATE can be verified using the following Stata command:

duplicates report gvkey datadate if indfmt=="INDL" & datafmt=="STD" & popsrc=="D" & consol=="C"

This command will return “no duplicates”.

The quarterly data (FUNDQ) is a little bit complicated. First of all, applying the same conditions won’t work. In fact, 99.7% observations in FUNDQ have already satisfied these conditions. But duplicate GVKEY-DATADATE pairs still exist in FUNDQ. The root cause of duplicate GVKEY-DATADATE pairs is a firm changing its fiscal year end. I use the following example as an illustration:

Variable definition: FYEARQ – fiscal year; FQTR – fiscal quarter; FYR – fiscal year end month; DATACQTR – calendar quarter; DATAFQTR – fiscal quarter; ATQ – total assets; NIQ – quarterly net income; NIY – year-to-date net income.

In this example, duplicates exist for three DATATEs: 2010-03-31, 2010-06-30, and 2010-09-30. The data suggest that on March 31, 2010, the firm changed its fiscal year end from March 31 to December 31 (i.e., FYR changed from 3 to 12). As a result, 2010-03-31 appeared twice in FUNDQ, one is fiscal 2009Q4 (based on the old fiscal year end) and the other is 2010Q1 (based on the new fiscal year end). FUNDQ also reports additional duplicates for the subsequent two quarters (I don’t know why). Next, if we compare NIQ and NIY as highlighted in the red rectangle, the observation for fiscal 2009Q4 indicates NIY > NIQ, which makes sense because NIY is four-quarter sum and NIQ is single-quarter net income. In contrast, the observation for fiscal 2010Q1 indicates NIQ = NIY, because both are single-quarter net income in this case.

So, what’s the best strategy to remove duplicate GVKEY-DATADATE pairs?

Before we answer this question, let’s take a closer look at duplicate GVKEY-DATADATE pairs in FUNDQ, which reveals that 99.8% of GVKEY-DATADATE pairs in FUNDQ are unique as of December 5, 2107. This suggests that no matter how we deal with duplicates, even simply delete all of them, our results probably won’t change in a noticeable way.

That said, if we want to remove duplicates in a more careful way, COMPUSTAT gives the following clue:

In the definition of DATAFQTR, COMPUSTAT notes that,

Note: Companies that undergo a fiscal-year change may have multiple records with the same datadate. Compustat delivers those multiple records with the same datadate but each record relates to a different fiscal year-end period.

Rule: Select records from the co_idesind data group where datafqtr is not null, to view as fiscal data.

Unfortunately, I find that the suggested rule is not the best strategy, because COMPUSTAT seems to set DATAFQTR as missing or non-missing in an inconsistent way. In my opinion, the best strategy is to retain the GVKEY-DATADATE that has reflected the most recent change of fiscal year end. This means, in the above example, we should delete the following observations:

    • DATADATE = 2010-03-31 and FYR = 3
    • DATADATE = 2010-06-30 and FYR = 3
    • DATADATE = 2010-09-30 and FYR = 3

Suppose we have converted FUNDQ raw file to Stata format, the following Stata codes will implement the above strategy and save the results in FUNDQ_NODUP. By the way, these are the cleverest Stata codes I’ve ever written 🙂 The codes also create a new variable FYQ (or CYQ) that represents fiscal year and quarter (or calendar year and quarter). The codes not only remove duplicate GVKEY-DATADATE pairs but also remove duplicate GVKEY-FYQ (or GVKEY-CYQ) pairs, which will then allow us to use the tsset command and do lag and change calculations in Stata, e.g., to get beginning-of-quarter total assets or calculate seasonal changes in sales. Stata really shines in lag and change calculations for panel data—a superb advantage over SAS.

Please note: I also agree with one of the readers’ comments that “(how to remove duplicates) depends on what you need”. For example, in one of my projects, I want to look at three-day CAR around earnings announcement date (RDQ) and use total assets as the deflator in my regression. As a result, when duplicate GVKEY-DATADATE pairs occur, the one with non-missing RDQ and ATQ will be preferred if I want to retain as many observations as possible. The following Stata commands will serve the purpose well:

 

Posted in Data, Stata | 18 Comments

If beginning year and ending year are known, how to fill in years in between?

Question:

Suppose two companies A and B are connected in some years. Say, right now the data structure is the following:

Company 1 Company 2 Starting Year Ending Year
A B 2000 2006
A C 1998 2003
C D 1995 1997

I want to find a way to generate:

Company 1 Company 2 Year
A B 2000
A B 2001
…………….
A B 2006
A C 1998
……………..
A C 2003
C D 1995
……………..
C D 1997

Answer:

 

Posted in SAS | 3 Comments

SAS macro to get analysts EPS consensus for a given fiscal period end (DATADATE) by a selected date (DATE)

Update: contains error! will update later.

I write this macro to compute analysts’ quarterly EPS consensus on a selected date (DATE) for a given fiscal quarter end (DATADATE). This macro can be easily modified for other types of estimates (e.g., annual EPS).

This macro currently extracts unadjusted quarterly EPS estimates (current or next quarter) issued within a specified window (e.g., WINDOW = 60 days) before the selected date (DATE), and computes the consensus at median and mean. This macro also extracts unadjusted actual EPS and put estimate and actual on the same per share basis. Please see Overview of Thomson-Reuters IBES for why this is the preferred method.

The output includes the mean and median of analysts’ EPS forecasts, the number of analysts forecasts, and the actual EPS. Please note the actual EPS is “street” EPS from I/B/E/S and may be different from reported EPS from Compustat (see I/B/E/S FAQ). With these outputs, you are able to compute earnings surprises easily.

This macro is modified from Post-Earnings Announcement Drift program on WRDS. I thank the author, Denys Glushkov, who is my favorite SAS programmer.

I looked into I/B/E/S database more closely and noted a few facts:

  • Since 1998 only 1.3% of analysts’ forecasts are reported on the primary basis (v.s. 74.4% before 1998), making the variable basis not important any more;
  • During the period from the issuance date of an individual quarterly EPS forecast to the announcement date of the actual EPS, stock split or reverse stock split happens in only 0.6% of observations. Thus, putting estimate and actual on the same per share basis is not important if you specify a short window, e.g., 60 days.
  • anndats in I/B/E/S Actuals Detail refers to the earnings announcement data. It is supposed to be the same with rdq in Compustat Quarterly. However, this is not true in 20% of observations. WRDS suggests that rdq is more accurate (See here).

 

Posted in SAS | Comments Off on SAS macro to get analysts EPS consensus for a given fiscal period end (DATADATE) by a selected date (DATE)