[Update on 2019-07-31] This post, together with its sibling post “Part II“, has been my most-viewed post since I created this website. However, the landscape of 10-K/Q filings has changed dramatically over the past decade, and the text-format filings are extremely unfriendly for researchers nowadays. I would suggest directing our research efforts to html-format filings with the help of BeautifulSoup. The other post deserves more attention.
[Update on 2018-10-06] As I acknowledged in the very first edition of this post, I borrowed some codes from Edouard Swiac’s Python module “python-edgar” (version: 1.0). Edouard kindly informed me that he had updated his module (see his GitHub page). The major updates to his module include: (1) he migrated the file download from FTP to HTTPS and (2) added parallel downloads so now it is faster to rebuild the full index, especially if going all the way to 1993. My initial thoughts about his updated module is that it provides more flexibility and should be more robust than mine. Thank you Edouard for your work!
[Update on 2017-03-03] SEC closed the FTP server permanently on December 30, 2016 and started to use a more secure transmission protocol—https. So the description about the FTP server in the original post is not applicable any more (but the basic idea about the URLs to raw text filings remain unchanged.) Since then I have received several requests to update the script. Here it is the new script for Part I.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
|
# Generate the list of index files archived in EDGAR since start_year (earliest: 1993) until the most recent quarter import datetime # Please download index files chunk by chunk. For example, please first download index files during 1993–2000, then # download index files during 2001–2005 by changing the following two lines repeatedly, and so on. If you need index # files up to the most recent year and quarter, comment out the following three lines, remove the comment sign at # the starting of the next three lines, and define the start_year that immediately follows the ending year of the # previous chunk. start_year = 2011 # change start_year and end_year to re-define the chunk current_year = 2015 # change start_year and end_year to re-define the chunk current_quarter = 4 # do not change this line # start_year = 2016 # only change this line to download the most recent chunk # current_year = datetime.date.today().year # current_quarter = (datetime.date.today().month - 1) // 3 + 1 years = list(range(start_year, current_year)) quarters = ['QTR1', 'QTR2', 'QTR3', 'QTR4'] history = [(y, q) for y in years for q in quarters] for i in range(1, current_quarter + 1): history.append((current_year, 'QTR%d' % i)) urls = ['https://www.sec.gov/Archives/edgar/full-index/%d/%s/master.idx' % (x[0], x[1]) for x in history] urls.sort() # Download index files and write content into SQLite import sqlite3 import requests con = sqlite3.connect('edgar_idx.db') cur = con.cursor() cur.execute('DROP TABLE IF EXISTS idx') cur.execute('CREATE TABLE idx (cik TEXT, conm TEXT, type TEXT, date TEXT, path TEXT)') for url in urls: lines = requests.get(url).content.decode("utf-8", "ignore").splitlines() records = [tuple(line.split('|')) for line in lines[11:]] cur.executemany('INSERT INTO idx VALUES (?, ?, ?, ?, ?)', records) print(url, 'downloaded and wrote to SQLite') con.commit() con.close() # Write SQLite database to Stata import pandas from sqlalchemy import create_engine engine = create_engine('sqlite:///edgar_idx.db') with engine.connect() as conn, conn.begin(): data = pandas.read_sql_table('idx', conn) data.to_stata('edgar_idx.dta', version=117) |
The technical details may be too boring to most people. So, I provide multiple downloadable Stata datasets that include all index files from 1993 Q1 to October 6, 2018.
Stata format (1993–2000); Stata format (2001–2005); Stata format (2006–2010); Stata format (2011–2015); Stata format (2016–2019/03/16)
[Original Post] We know that SEC makes company filings (e.g.,10-Ks, 10-Qs and 8-Ks) publicly available on EDGAR. The web search interface is convenient, but we may need to bulk download raw text filings. SEC provides an anonymous EDGAR FTP server to access raw text filings. Usually, if we know the path or URL to a file on an FTP server, we can easily use an Internet browser or an FTP software to connect to the server and download the file. For example, if we navigate a bit on the EDGAR FTP server, we can find the path to the file “master.idx” as follows:
ftp://ftp.sec.gov/edgar/full-index/2015/QTR4/master.idx
Copy the path into an Internet browser or an FTP software, we can download the file directly.
In the above example, we can find the path to “master.idx” by navigating on the EDGAR FTP server. But we cannot find any path to any raw text filing. In other words, paths to raw text filings are not visible by simply looking into the EDGAR FTP server. SEC purposely hides paths to raw text filings to reduce server load and avoid data abuse.
In order to download SEC filings on EDGAR, we have to:
- Find paths to raw text filings;
- Select what we want and bulk download raw text filings from the EDGAR FTP server using paths we have obtained in the first step.
This post describes the first step, and I elaborate the second step in another post.
SEC stores all path information in index files. See technical details here. Let’s take a snap shot of an index file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
|
Description: Master Index of EDGAR Dissemination Feed Last Data Received: December 31, 2015 Comments: webmaster@sec.gov Anonymous FTP: ftp://ftp.sec.gov/edgar/ CIK|Company Name|Form Type|Date Filed|Filename -------------------------------------------------------------------------------- 1000028|AIM SAFETY CO INC|REVOKED|2015-12-21|edgar/data/1000028/9999999997-15-015914.txt 1000032|BINCH JAMES G|4|2015-12-02|edgar/data/1000032/0001209191-15-082911.txt 1000032|BINCH JAMES G|4|2015-12-11|edgar/data/1000032/0001209191-15-084690.txt 1000045|NICHOLAS FINANCIAL INC|10-Q|2015-11-09|edgar/data/1000045/0001193125-15-371499.txt 1000045|NICHOLAS FINANCIAL INC|8-K|2015-10-29|edgar/data/1000045/0001193125-15-357258.txt 1000045|NICHOLAS FINANCIAL INC|SC 13G/A|2015-11-03|edgar/data/1000045/0001436857-15-000017.txt 1000097|KINGDON CAPITAL MANAGEMENT, L.L.C.|13F-HR|2015-11-16|edgar/data/1000097/0000919574-15-008246.txt 1000097|KINGDON CAPITAL MANAGEMENT, L.L.C.|SC 13G|2015-12-23|edgar/data/1000097/0000919574-15-008800.txt 1000177|NORDIC AMERICAN TANKERS Ltd|6-K|2015-11-12|edgar/data/1000177/0000919574-15-007734.txt 1000177|NORDIC AMERICAN TANKERS Ltd|6-K|2015-11-12|edgar/data/1000177/0000919574-15-007742.txt 1000177|NORDIC AMERICAN TANKERS Ltd|SC 13D/A|2015-11-17|edgar/data/1000177/0000919574-15-008279.txt ... ... |
The last field on a line in the main body of the index file shows the path to a real raw text filing. What we have to do in the first step is to download and parse all index files and write the content into a database. Then in the second step, we can execute any query into the database (e.g., select certain form type or certain period of time) and download raw text filings using selected paths.
I write the following Python program to execute the first step. This program borrows some codes from Edouard Swiac’s Python module “python-edgar” (version: 1.0). Please see his package information page here.
Please note: my program stores all paths in an SQLite database. I personally like the lightweight database product very much. The last few lines of my program transfer data from the SQLite database to an Stata dataset for users who are not familiar with SQLite. To do so, I use two Python modules: pandas
and sqlalchemy
which you have to install using pip
command on your own. Please google documentations of SQLite, Pandas, and SQLAchemy if you have installation problems. I am using Python 3.x in all my Python posts.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
|
import datetime import ftplib import sqlite3 import tempfile import zipfile # Generate the list of quarterly zip files archived in EDGAR since # start_year (earliest: 1993) until the most recent quarter current_year = datetime.date.today().year current_quarter = (datetime.date.today().month - 1) // 3 + 1 start_year = 1993 years = list(range(start_year, current_year)) quarters = ['QTR1', 'QTR2', 'QTR3', 'QTR4'] history = [(y, q) for y in years for q in quarters] for i in range(1, current_quarter): history.append((current_year, 'QTR%d' % i)) quarterly_files = ['edgar/full-index/%d/%s/master.zip' % (x[0], x[1]) for x in history] quarterly_files.sort() # Generate the list of daily index files archived in EDGAR for the most # recent quarter ftp = ftplib.FTP('ftp.sec.gov') ftp.login() daily_files = [f for f in ftp.nlst('edgar/daily-index') if f.startswith('edgar/daily-index/master')] daily_files.sort() # Download index files and write content into SQLite con = sqlite3.connect('edgar_idx.db') cur = con.cursor() cur.execute('DROP TABLE IF EXISTS idx') cur.execute('CREATE TABLE idx (cik TEXT, conm TEXT, type TEXT, date TEXT, ' 'path TEXT)') for file in quarterly_files: with tempfile.TemporaryFile() as temp: ftp.retrbinary('RETR %s' % file, temp.write) with zipfile.ZipFile(temp).open('master.idx') as z: for i in range(10): z.readline() records = [tuple(line.decode('latin-1').rstrip().split('|')) for line in z] cur.executemany('INSERT INTO idx VALUES (?, ?, ?, ?, ?)', records) print(file, 'downloaded and wrote to SQLite') for file in daily_files: with tempfile.TemporaryFile() as temp: ftp.retrbinary('RETR %s' % file, temp.write) temp.seek(0) for i in range(7): temp.readline() records = [tuple(line.decode('latin-1').rstrip().split('|')) for line in temp] cur.executemany('INSERT INTO idx VALUES (?, ?, ?, ?, ?)', records) print(file, 'downloaded and wrote to SQLite') con.commit() con.close() ftp.close() # Write SQLite database to Stata import pandas from sqlalchemy import create_engine engine = create_engine('sqlite:///edgar_idx.db') with engine.connect() as conn, conn.begin(): data = pandas.read_sql_table('idx', conn) data.to_stata('edgar_idx.dta') |
I find two articles explain how to use R and Perl to achieve the same functionality. I include the links (R or Perl) for users who are more comfortable with R or Perl.