[Update on 2020-06-26] Eduardo has made a significant improvement to the code. Now you can specify a starting date and download the index file during the period from that starting date to the most recent date. I expect it to be very useful for many readers of my website. Eduardo has kindly shared the code in the comment. Thank you, Eduardo!
[Update on 2019-08-07] From time to time, some readers informed that the first-part code seemingly stopped at certain quarters. I don’t know the exact reason (perhaps a server-side issue). I never encountered the issue. I would suggest that you just try again later. I also share a Dropbox link from which you can download the first-part results (as of 2019-08-07; 2.4GB) in the CSV format (link). Please note—as I have explained in my original post, the URL contained in the downloadable CSV is not the URL to the HTML-format filing; it is just the URL to an index page. You need to select your sample and go through the second-part code to get the URL to the HTML-format filing.
[Original post] I wrote two posts to describe how to download TXT-format SEC filings on EDGAR:
Although TXT-format files have benefits of easy further handling, they are oftentimes not well formatted and thus hard to read. A HTML-format 10-K is more pleasing to eyes. Actually, SEC also provides the paths (namely, URLs) to HTML-format filings. With the path, we can open a HTML-format filing in a web browser, or further download the filing as a PDF.
There remain two parts in the Python code. In the first part, we need download the path data. Instead of using master.idx
in the above two posts, we need use crawler.idx
for this task. The path we get will be a URL like this:
https://www.sec.gov/Archives/edgar/data/859747/0001477932-16-007969-index.htm
Note that the path we get is a URL to an index page, not a URL to the HTML-format 10-Q in this example. To get the direct URL to the HTML-format 10-Q, we have to go one-level deeper. The second part of the Python code is used to go that deeper and extract the direct URL to the main body of the Form (the URL embedded in the first row in more than 99% cases). The code also extracts such information as filing date and period of report on the index page. The code writes the output (including filing date, period of report and direct URL) in log.csv
. The following is an output example—the first URL is the path we get in the first part of the code; the second URL is the direct URL to the HTML-format Form.
|
13780110,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2014,https://www.sec.gov/Archives/edgar/data/5272/0000005272-14-000007-index.htm,2017-10-11 23:44:42,2017-10-11 23:44:50,2014-05-05,2014-03-31,https://www.sec.gov/Archives/edgar/data/5272/000000527214000007/maindocument001.htm 16212215,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,03nov2016,https://www.sec.gov/Archives/edgar/data/5272/0000005272-16-000052-index.htm,2017-10-11 23:44:51,2017-10-11 23:44:58,2016-11-03,2016-09-30,https://www.sec.gov/Archives/edgar/data/5272/000000527216000052/maindocument001.htm 6772655,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2007,https://www.sec.gov/Archives/edgar/data/5272/0000950123-07-007211-index.htm,2017-10-11 23:44:59,2017-10-11 23:45:05,2007-05-10,2007-03-31,https://www.sec.gov/Archives/edgar/data/5272/000095012307007211/y32085e10vq.htm 5671285,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2006,https://www.sec.gov/Archives/edgar/data/5272/0000950123-06-006104-index.htm,2017-10-11 23:45:07,2017-10-11 23:45:14,2006-05-10,2006-03-31,https://www.sec.gov/Archives/edgar/data/5272/000095012306006104/y19465e10vq.htm 10831058,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2011,https://www.sec.gov/Archives/edgar/data/5272/0001047469-11-004647-index.htm,2017-10-11 23:45:15,2017-10-11 23:45:20,2011-05-05,2011-03-31,https://www.sec.gov/Archives/edgar/data/5272/000104746911004647/a2203832z10-q.htm |
The first part of the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
|
# Generate the list of index files archived in EDGAR since start_year (earliest: 1993) until the most recent quarter import datetime current_year = datetime.date.today().year current_quarter = (datetime.date.today().month - 1) // 3 + 1 start_year = 1993 years = list(range(start_year, current_year)) quarters = ['QTR1', 'QTR2', 'QTR3', 'QTR4'] history = [(y, q) for y in years for q in quarters] for i in range(1, current_quarter + 1): history.append((current_year, 'QTR%d' % i)) urls = ['https://www.sec.gov/Archives/edgar/full-index/%d/%s/crawler.idx' % (x[0], x[1]) for x in history] urls.sort() # Download index files and write content into SQLite import sqlite3 import requests con = sqlite3.connect('edgar_htm_idx.db') cur = con.cursor() cur.execute('DROP TABLE IF EXISTS idx') cur.execute('CREATE TABLE idx (conm TEXT, type TEXT, cik TEXT, date TEXT, path TEXT)') for url in urls: lines = requests.get(url).text.splitlines() nameloc = lines[7].find('Company Name') typeloc = lines[7].find('Form Type') cikloc = lines[7].find('CIK') dateloc = lines[7].find('Date Filed') urlloc = lines[7].find('URL') records = [tuple([line[:typeloc].strip(), line[typeloc:cikloc].strip(), line[cikloc:dateloc].strip(), line[dateloc:urlloc].strip(), line[urlloc:].strip()]) for line in lines[9:]] cur.executemany('INSERT INTO idx VALUES (?, ?, ?, ?, ?)', records) print(url, 'downloaded and wrote to SQLite') con.commit() con.close() # Write SQLite database to Stata import pandas from sqlalchemy import create_engine engine = create_engine('sqlite:///edgar_htm_idx.db') with engine.connect() as conn, conn.begin(): data = pandas.read_sql_table('idx', conn) data.to_stata('edgar_htm_idx.dta') |
The first part of the code generates a dataset of the complete path information of SEC filings for the selected period (in both SQLite and Stata). Then, you can select a sample based on firm, form type, filing date, etc. and feed a CSV file to the second part of the code. The feeding CSV should look like this:
|
13780110,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2014,https://www.sec.gov/Archives/edgar/data/5272/0000005272-14-000007-index.htm 16212215,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,03nov2016,https://www.sec.gov/Archives/edgar/data/5272/0000005272-16-000052-index.htm 6772655,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2007,https://www.sec.gov/Archives/edgar/data/5272/0000950123-07-007211-index.htm 5671285,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2006,https://www.sec.gov/Archives/edgar/data/5272/0000950123-06-006104-index.htm 10831058,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2011,https://www.sec.gov/Archives/edgar/data/5272/0001047469-11-004647-index.htm |
The second part of the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
|
import csv import random import time from selenium import webdriver with open('log.csv', 'w', newline='') as log: logwriter = csv.writer(log) with open('sample.csv', newline='') as infile: records = csv.reader(infile) for r in records: log_row = r.copy() print('Start fetching URL to', r[2], r[3], 'filed on', r[4], '...') start_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime()) driver = webdriver.Chrome('./chromedriver') try: driver.get(r[5]) time.sleep(3 + random.random() * 3) filing_date = driver.find_element_by_xpath('//*[@id="formDiv"]/div[2]/div[1]/div[2]').text period_of_report = driver.find_element_by_xpath('//*[@id="formDiv"]/div[2]/div[2]/div[2]').text form_text = driver.find_element_by_xpath('//*[@id="formDiv"]/div/table/tbody/tr[2]/td[3]/a').text form_link = driver.find_element_by_link_text(form_text).get_attribute('href') end_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime()) print('Success!', start_time, ' --> ', end_time, '\n') log_row = log_row + [start_time, end_time, filing_date, period_of_report, form_link] except: end_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime()) print('Error!', start_time, ' --> ', end_time, '\n') log_row = log_row + [start_time, end_time, 'ERROR!'] driver.quit() logwriter.writerow(log_row) |
Please note:
- Please use Python 3.x.
- Please install all required modules such as Selenium.
- The second part of the code uses Selenium. There are other ways to do the job, e.g., using BeautifulSoup.
- The second part of the code only output the direct URL to the HTML-format filing; it does not download the filing.