Kai Chen

[Update on 2019-07-31] This post, together with its sibling post “Part II“, has been my most-viewed post since I created this website. However, the landscape of 10-K/Q filings has changed dramatically over the past decade, and the text-format filings are extremely unfriendly for researchers nowadays. I would suggest directing our research efforts to html-format filings with the help of BeautifulSoup. The other post deserves more attention.

[Update on 2018-10-06] As I acknowledged in the very first edition of this post, I borrowed some codes from Edouard Swiac’s Python module “python-edgar” (version: 1.0). Edouard kindly informed me that he had updated his module (see his GitHub page). The major updates to his module include: (1) he migrated the file download from FTP to HTTPS and (2) added parallel downloads so now it is faster to rebuild the full index, especially if going all the way to 1993. My initial thoughts about his updated module is that it provides more flexibility and should be more robust than mine. Thank you Edouard for your work!

[Update on 2017-03-03] SEC closed the FTP server permanently on December 30, 2016 and started to use a more secure transmission protocol—https. So the description about the FTP server in the original post is not applicable any more (but the basic idea about the URLs to raw text filings remain unchanged.) Since then I have received several requests to update the script. Here it is the new script for Part I.

# Generate the list of index files archived in EDGAR since start_year (earliest: 1993) until the most recent quarter
import datetime

# Please download index files chunk by chunk. For example, please first download index files during 1993–2000, then
# download index files during 2001–2005 by changing the following two lines repeatedly, and so on. If you need index
# files up to the most recent year and quarter, comment out the following three lines, remove the comment sign at
# the starting of the next three lines, and define the start_year that immediately follows the ending year of the
# previous chunk.

start_year = 2011       # change start_year and end_year to re-define the chunk
current_year = 2015     # change start_year and end_year to re-define the chunk
current_quarter = 4     # do not change this line

# start_year = 2016     # only change this line to download the most recent chunk
# current_year = datetime.date.today().year
# current_quarter = (datetime.date.today().month - 1) // 3 + 1

years = list(range(start_year, current_year))
quarters = ['QTR1', 'QTR2', 'QTR3', 'QTR4']
history = [(y, q) for y in years for q in quarters]
for i in range(1, current_quarter + 1):
    history.append((current_year, 'QTR%d' % i))
urls = ['https://www.sec.gov/Archives/edgar/full-index/%d/%s/master.idx' % (x[0], x[1]) for x in history]
urls.sort()

# Download index files and write content into SQLite
import sqlite3
import requests

con = sqlite3.connect('edgar_idx.db')
cur = con.cursor()
cur.execute('DROP TABLE IF EXISTS idx')
cur.execute('CREATE TABLE idx (cik TEXT, conm TEXT, type TEXT, date TEXT, path TEXT)')

for url in urls:
    lines = requests.get(url).content.decode("utf-8", "ignore").splitlines()
    records = [tuple(line.split('|')) for line in lines[11:]]
    cur.executemany('INSERT INTO idx VALUES (?, ?, ?, ?, ?)', records)
    print(url, 'downloaded and wrote to SQLite')

con.commit()
con.close()

# Write SQLite database to Stata
import pandas
from sqlalchemy import create_engine

engine = create_engine('sqlite:///edgar_idx.db')
with engine.connect() as conn, conn.begin():
    data = pandas.read_sql_table('idx', conn)
    data.to_stata('edgar_idx.dta', version=117)

# Generate the list of index files archived in EDGAR since start_year (earliest: 1993) until the most recent quarter

import datetime

# Please download index files chunk by chunk. For example, please first download index files during 1993–2000, then

# download index files during 2001–2005 by changing the following two lines repeatedly, and so on. If you need index

# files up to the most recent year and quarter, comment out the following three lines, remove the comment sign at

# the starting of the next three lines, and define the start_year that immediately follows the ending year of the

# previous chunk.

start_year = 2011 # change start_year and end_year to re-define the chunk

current_year = 2015 # change start_year and end_year to re-define the chunk

current_quarter = 4 # do not change this line

# start_year = 2016 # only change this line to download the most recent chunk

# current_year = datetime.date.today().year

# current_quarter = (datetime.date.today().month - 1) // 3 + 1

years = list(range(start_year, current_year))

quarters = ['QTR1', 'QTR2', 'QTR3', 'QTR4']

history = [(y, q) for y in years for q in quarters]

for i in range(1, current_quarter + 1):

history.append((current_year, 'QTR%d' % i))

urls = ['https://www.sec.gov/Archives/edgar/full-index/%d/%s/master.idx' % (x[0], x[1]) for x in history]

urls.sort()

# Download index files and write content into SQLite

import sqlite3

import requests

con = sqlite3.connect('edgar_idx.db')

cur = con.cursor()

cur.execute('DROP TABLE IF EXISTS idx')

cur.execute('CREATE TABLE idx (cik TEXT, conm TEXT, type TEXT, date TEXT, path TEXT)')

for url in urls:

lines = requests.get(url).content.decode("utf-8", "ignore").splitlines()

records = [tuple(line.split('|')) for line in lines[11:]]

cur.executemany('INSERT INTO idx VALUES (?, ?, ?, ?, ?)', records)

print(url, 'downloaded and wrote to SQLite')

con.commit()

con.close()

# Write SQLite database to Stata

import pandas

from sqlalchemy import create_engine

engine = create_engine('sqlite:///edgar_idx.db')

with engine.connect() as conn, conn.begin():

data = pandas.read_sql_table('idx', conn)

data.to_stata('edgar_idx.dta', version=117)

The technical details may be too boring to most people. So, I provide multiple downloadable Stata datasets that include all index files from 1993 Q1 to October 6, 2018.

Stata format (1993–2000); Stata format (2001–2005); Stata format (2006–2010); Stata format (2011–2015); Stata format (2016–2019/03/16)

[Original Post] We know that SEC makes company filings (e.g.,10-Ks, 10-Qs and 8-Ks) publicly available on EDGAR. The web search interface is convenient, but we may need to bulk download raw text filings. SEC provides an anonymous EDGAR FTP server to access raw text filings. Usually, if we know the path or URL to a file on an FTP server, we can easily use an Internet browser or an FTP software to connect to the server and download the file. For example, if we navigate a bit on the EDGAR FTP server, we can find the path to the file “master.idx” as follows:

ftp://ftp.sec.gov/edgar/full-index/2015/QTR4/master.idx

Copy the path into an Internet browser or an FTP software, we can download the file directly.

In the above example, we can find the path to “master.idx” by navigating on the EDGAR FTP server. But we cannot find any path to any raw text filing. In other words, paths to raw text filings are not visible by simply looking into the EDGAR FTP server. SEC purposely hides paths to raw text filings to reduce server load and avoid data abuse.

In order to download SEC filings on EDGAR, we have to:

Find paths to raw text filings;
Select what we want and bulk download raw text filings from the EDGAR FTP server using paths we have obtained in the first step.

This post describes the first step, and I elaborate the second step in another post.

SEC stores all path information in index files. See technical details here. Let’s take a snap shot of an index file:

Description:           Master Index of EDGAR Dissemination Feed
Last Data Received:    December 31, 2015
Comments:              webmaster@sec.gov
Anonymous FTP:         ftp://ftp.sec.gov/edgar/
 
 
 
 
CIK|Company Name|Form Type|Date Filed|Filename
--------------------------------------------------------------------------------
1000028|AIM SAFETY CO INC|REVOKED|2015-12-21|edgar/data/1000028/9999999997-15-015914.txt
1000032|BINCH JAMES G|4|2015-12-02|edgar/data/1000032/0001209191-15-082911.txt
1000032|BINCH JAMES G|4|2015-12-11|edgar/data/1000032/0001209191-15-084690.txt
1000045|NICHOLAS FINANCIAL INC|10-Q|2015-11-09|edgar/data/1000045/0001193125-15-371499.txt
1000045|NICHOLAS FINANCIAL INC|8-K|2015-10-29|edgar/data/1000045/0001193125-15-357258.txt
1000045|NICHOLAS FINANCIAL INC|SC 13G/A|2015-11-03|edgar/data/1000045/0001436857-15-000017.txt
1000097|KINGDON CAPITAL MANAGEMENT, L.L.C.|13F-HR|2015-11-16|edgar/data/1000097/0000919574-15-008246.txt
1000097|KINGDON CAPITAL MANAGEMENT, L.L.C.|SC 13G|2015-12-23|edgar/data/1000097/0000919574-15-008800.txt
1000177|NORDIC AMERICAN TANKERS Ltd|6-K|2015-11-12|edgar/data/1000177/0000919574-15-007734.txt
1000177|NORDIC AMERICAN TANKERS Ltd|6-K|2015-11-12|edgar/data/1000177/0000919574-15-007742.txt
1000177|NORDIC AMERICAN TANKERS Ltd|SC 13D/A|2015-11-17|edgar/data/1000177/0000919574-15-008279.txt
...
...

Description: Master Index of EDGAR Dissemination Feed

Last Data Received: December 31, 2015

Comments: webmaster@sec.gov

Anonymous FTP: ftp://ftp.sec.gov/edgar/

CIK|Company Name|Form Type|Date Filed|Filename

--------------------------------------------------------------------------------

1000028|AIM SAFETY CO INC|REVOKED|2015-12-21|edgar/data/1000028/9999999997-15-015914.txt

1000032|BINCH JAMES G|4|2015-12-02|edgar/data/1000032/0001209191-15-082911.txt

1000032|BINCH JAMES G|4|2015-12-11|edgar/data/1000032/0001209191-15-084690.txt

1000045|NICHOLAS FINANCIAL INC|10-Q|2015-11-09|edgar/data/1000045/0001193125-15-371499.txt

1000045|NICHOLAS FINANCIAL INC|8-K|2015-10-29|edgar/data/1000045/0001193125-15-357258.txt

1000045|NICHOLAS FINANCIAL INC|SC 13G/A|2015-11-03|edgar/data/1000045/0001436857-15-000017.txt

1000097|KINGDON CAPITAL MANAGEMENT, L.L.C.|13F-HR|2015-11-16|edgar/data/1000097/0000919574-15-008246.txt

1000097|KINGDON CAPITAL MANAGEMENT, L.L.C.|SC 13G|2015-12-23|edgar/data/1000097/0000919574-15-008800.txt

1000177|NORDIC AMERICAN TANKERS Ltd|6-K|2015-11-12|edgar/data/1000177/0000919574-15-007734.txt

1000177|NORDIC AMERICAN TANKERS Ltd|6-K|2015-11-12|edgar/data/1000177/0000919574-15-007742.txt

1000177|NORDIC AMERICAN TANKERS Ltd|SC 13D/A|2015-11-17|edgar/data/1000177/0000919574-15-008279.txt

...

The last field on a line in the main body of the index file shows the path to a real raw text filing. What we have to do in the first step is to download and parse all index files and write the content into a database. Then in the second step, we can execute any query into the database (e.g., select certain form type or certain period of time) and download raw text filings using selected paths.

I write the following Python program to execute the first step. This program borrows some codes from Edouard Swiac’s Python module “python-edgar” (version: 1.0). Please see his package information page here.

Please note: my program stores all paths in an SQLite database. I personally like the lightweight database product very much. The last few lines of my program transfer data from the SQLite database to an Stata dataset for users who are not familiar with SQLite. To do so, I use two Python modules: pandas and sqlalchemy which you have to install using pip command on your own. Please google documentations of SQLite, Pandas, and SQLAchemy if you have installation problems. I am using Python 3.x in all my Python posts.

import datetime
import ftplib
import sqlite3
import tempfile
import zipfile

# Generate the list of quarterly zip files archived in EDGAR since
# start_year (earliest: 1993) until the most recent quarter
current_year = datetime.date.today().year
current_quarter = (datetime.date.today().month - 1) // 3 + 1
start_year = 1993
years = list(range(start_year, current_year))
quarters = ['QTR1', 'QTR2', 'QTR3', 'QTR4']
history = [(y, q) for y in years for q in quarters]
for i in range(1, current_quarter):
    history.append((current_year, 'QTR%d' % i))
quarterly_files = ['edgar/full-index/%d/%s/master.zip' % (x[0], x[1]) for x
                   in history]
quarterly_files.sort()

# Generate the list of daily index files archived in EDGAR for the most
# recent quarter
ftp = ftplib.FTP('ftp.sec.gov')
ftp.login()
daily_files = [f for f in ftp.nlst('edgar/daily-index') if
               f.startswith('edgar/daily-index/master')]
daily_files.sort()

# Download index files and write content into SQLite
con = sqlite3.connect('edgar_idx.db')
cur = con.cursor()
cur.execute('DROP TABLE IF EXISTS idx')
cur.execute('CREATE TABLE idx (cik TEXT, conm TEXT, type TEXT, date TEXT, '
            'path TEXT)')

for file in quarterly_files:
    with tempfile.TemporaryFile() as temp:
        ftp.retrbinary('RETR %s' % file, temp.write)
        with zipfile.ZipFile(temp).open('master.idx') as z:
            for i in range(10):
                z.readline()
            records = [tuple(line.decode('latin-1').rstrip().split('|')) for
                       line in z]
    cur.executemany('INSERT INTO idx VALUES (?, ?, ?, ?, ?)', records)
    print(file, 'downloaded and wrote to SQLite')

for file in daily_files:
    with tempfile.TemporaryFile() as temp:
        ftp.retrbinary('RETR %s' % file, temp.write)
        temp.seek(0)
        for i in range(7):
            temp.readline()
        records = [tuple(line.decode('latin-1').rstrip().split('|')) for
                   line in temp]
    cur.executemany('INSERT INTO idx VALUES (?, ?, ?, ?, ?)', records)
    print(file, 'downloaded and wrote to SQLite')

con.commit()
con.close()

ftp.close()

# Write SQLite database to Stata
import pandas
from sqlalchemy import create_engine

engine = create_engine('sqlite:///edgar_idx.db')
with engine.connect() as conn, conn.begin():
    data = pandas.read_sql_table('idx', conn)
    data.to_stata('edgar_idx.dta')

import datetime

import ftplib

import sqlite3

import tempfile

import zipfile

# Generate the list of quarterly zip files archived in EDGAR since

# start_year (earliest: 1993) until the most recent quarter

current_year = datetime.date.today().year

current_quarter = (datetime.date.today().month - 1) // 3 + 1

start_year = 1993

years = list(range(start_year, current_year))

quarters = ['QTR1', 'QTR2', 'QTR3', 'QTR4']

history = [(y, q) for y in years for q in quarters]

for i in range(1, current_quarter):

history.append((current_year, 'QTR%d' % i))

quarterly_files = ['edgar/full-index/%d/%s/master.zip' % (x[0], x[1]) for x

in history]

quarterly_files.sort()

# Generate the list of daily index files archived in EDGAR for the most

# recent quarter

ftp = ftplib.FTP('ftp.sec.gov')

ftp.login()

daily_files = [f for f in ftp.nlst('edgar/daily-index') if

f.startswith('edgar/daily-index/master')]

daily_files.sort()

# Download index files and write content into SQLite

con = sqlite3.connect('edgar_idx.db')

cur = con.cursor()

cur.execute('DROP TABLE IF EXISTS idx')

cur.execute('CREATE TABLE idx (cik TEXT, conm TEXT, type TEXT, date TEXT, '

'path TEXT)')

for file in quarterly_files:

with tempfile.TemporaryFile() as temp:

ftp.retrbinary('RETR %s' % file, temp.write)

with zipfile.ZipFile(temp).open('master.idx') as z:

for i in range(10):

z.readline()

records = [tuple(line.decode('latin-1').rstrip().split('|')) for

line in z]

cur.executemany('INSERT INTO idx VALUES (?, ?, ?, ?, ?)', records)

print(file, 'downloaded and wrote to SQLite')

for file in daily_files:

with tempfile.TemporaryFile() as temp:

ftp.retrbinary('RETR %s' % file, temp.write)

temp.seek(0)

for i in range(7):

temp.readline()

records = [tuple(line.decode('latin-1').rstrip().split('|')) for

line in temp]

cur.executemany('INSERT INTO idx VALUES (?, ?, ?, ?, ?)', records)

print(file, 'downloaded and wrote to SQLite')

con.commit()

con.close()

ftp.close()

# Write SQLite database to Stata

import pandas

from sqlalchemy import create_engine

engine = create_engine('sqlite:///edgar_idx.db')

with engine.connect() as conn, conn.begin():

data = pandas.read_sql_table('idx', conn)

data.to_stata('edgar_idx.dta')

I find two articles explain how to use R and Perl to achieve the same functionality. I include the links (R or Perl) for users who are more comfortable with R or Perl.

EDGAR index files in Stata dataset (from 1993 Q1 to March 2, 2017)

Link RSSD with PERMCO

My favorite econometrics textbook

Use Python to download TXT-format SEC filings on EDGAR (Part I)

Which is better for archival accounting research, SAS or Stata?

Categories

Archives

Site Admin