Use Python to extract URLs to HTML-format SEC filings on EDGAR

Posted on October 25, 2017 by Kai Chen

[Update on 2020-06-26] Eduardo has made a significant improvement to the code. Now you can specify a starting date and download the index file during the period from that starting date to the most recent date. I expect it to be very useful for many readers of my website. Eduardo has kindly shared the code in the comment. Thank you, Eduardo!

[Update on 2019-08-07] From time to time, some readers informed that the first-part code seemingly stopped at certain quarters. I don’t know the exact reason (perhaps a server-side issue). I never encountered the issue. I would suggest that you just try again later. I also share a Dropbox link from which you can download the first-part results (as of 2019-08-07; 2.4GB) in the CSV format (link). Please note—as I have explained in my original post, the URL contained in the downloadable CSV is not the URL to the HTML-format filing; it is just the URL to an index page. You need to select your sample and go through the second-part code to get the URL to the HTML-format filing.

[Original post] I wrote two posts to describe how to download TXT-format SEC filings on EDGAR:

Although TXT-format files have benefits of easy further handling, they are oftentimes not well formatted and thus hard to read. A HTML-format 10-K is more pleasing to eyes. Actually, SEC also provides the paths (namely, URLs) to HTML-format filings. With the path, we can open a HTML-format filing in a web browser, or further download the filing as a PDF.

There remain two parts in the Python code. In the first part, we need download the path data. Instead of using master.idx in the above two posts, we need use crawler.idx for this task. The path we get will be a URL like this:

https://www.sec.gov/Archives/edgar/data/859747/0001477932-16-007969-index.htm

Note that the path we get is a URL to an index page, not a URL to the HTML-format 10-Q in this example. To get the direct URL to the HTML-format 10-Q, we have to go one-level deeper. The second part of the Python code is used to go that deeper and extract the direct URL to the main body of the Form (the URL embedded in the first row in more than 99% cases). The code also extracts such information as filing date and period of report on the index page. The code writes the output (including filing date, period of report and direct URL) in log.csv. The following is an output example—the first URL is the path we get in the first part of the code; the second URL is the direct URL to the HTML-format Form.

13780110,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2014,https://www.sec.gov/Archives/edgar/data/5272/0000005272-14-000007-index.htm,2017-10-11 23:44:42,2017-10-11 23:44:50,2014-05-05,2014-03-31,https://www.sec.gov/Archives/edgar/data/5272/000000527214000007/maindocument001.htm
16212215,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,03nov2016,https://www.sec.gov/Archives/edgar/data/5272/0000005272-16-000052-index.htm,2017-10-11 23:44:51,2017-10-11 23:44:58,2016-11-03,2016-09-30,https://www.sec.gov/Archives/edgar/data/5272/000000527216000052/maindocument001.htm
6772655,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2007,https://www.sec.gov/Archives/edgar/data/5272/0000950123-07-007211-index.htm,2017-10-11 23:44:59,2017-10-11 23:45:05,2007-05-10,2007-03-31,https://www.sec.gov/Archives/edgar/data/5272/000095012307007211/y32085e10vq.htm
5671285,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2006,https://www.sec.gov/Archives/edgar/data/5272/0000950123-06-006104-index.htm,2017-10-11 23:45:07,2017-10-11 23:45:14,2006-05-10,2006-03-31,https://www.sec.gov/Archives/edgar/data/5272/000095012306006104/y19465e10vq.htm
10831058,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2011,https://www.sec.gov/Archives/edgar/data/5272/0001047469-11-004647-index.htm,2017-10-11 23:45:15,2017-10-11 23:45:20,2011-05-05,2011-03-31,https://www.sec.gov/Archives/edgar/data/5272/000104746911004647/a2203832z10-q.htm

13780110,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2014,https://www.sec.gov/Archives/edgar/data/5272/0000005272-14-000007-index.htm,2017-10-11 23:44:42,2017-10-11 23:44:50,2014-05-05,2014-03-31,https://www.sec.gov/Archives/edgar/data/5272/000000527214000007/maindocument001.htm

16212215,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,03nov2016,https://www.sec.gov/Archives/edgar/data/5272/0000005272-16-000052-index.htm,2017-10-11 23:44:51,2017-10-11 23:44:58,2016-11-03,2016-09-30,https://www.sec.gov/Archives/edgar/data/5272/000000527216000052/maindocument001.htm

6772655,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2007,https://www.sec.gov/Archives/edgar/data/5272/0000950123-07-007211-index.htm,2017-10-11 23:44:59,2017-10-11 23:45:05,2007-05-10,2007-03-31,https://www.sec.gov/Archives/edgar/data/5272/000095012307007211/y32085e10vq.htm

5671285,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2006,https://www.sec.gov/Archives/edgar/data/5272/0000950123-06-006104-index.htm,2017-10-11 23:45:07,2017-10-11 23:45:14,2006-05-10,2006-03-31,https://www.sec.gov/Archives/edgar/data/5272/000095012306006104/y19465e10vq.htm

10831058,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2011,https://www.sec.gov/Archives/edgar/data/5272/0001047469-11-004647-index.htm,2017-10-11 23:45:15,2017-10-11 23:45:20,2011-05-05,2011-03-31,https://www.sec.gov/Archives/edgar/data/5272/000104746911004647/a2203832z10-q.htm

The first part of the code:

# Generate the list of index files archived in EDGAR since start_year (earliest: 1993) until the most recent quarter
import datetime

current_year = datetime.date.today().year
current_quarter = (datetime.date.today().month - 1) // 3 + 1
start_year = 1993
years = list(range(start_year, current_year))
quarters = ['QTR1', 'QTR2', 'QTR3', 'QTR4']
history = [(y, q) for y in years for q in quarters]
for i in range(1, current_quarter + 1):
    history.append((current_year, 'QTR%d' % i))
urls = ['https://www.sec.gov/Archives/edgar/full-index/%d/%s/crawler.idx' % (x[0], x[1]) for x in history]
urls.sort()

# Download index files and write content into SQLite
import sqlite3
import requests

con = sqlite3.connect('edgar_htm_idx.db')
cur = con.cursor()
cur.execute('DROP TABLE IF EXISTS idx')
cur.execute('CREATE TABLE idx (conm TEXT, type TEXT, cik TEXT, date TEXT, path TEXT)')

for url in urls:
    lines = requests.get(url).text.splitlines()
    nameloc = lines[7].find('Company Name')
    typeloc = lines[7].find('Form Type')
    cikloc = lines[7].find('CIK')
    dateloc = lines[7].find('Date Filed')
    urlloc = lines[7].find('URL')
    records = [tuple([line[:typeloc].strip(), line[typeloc:cikloc].strip(), line[cikloc:dateloc].strip(),
                      line[dateloc:urlloc].strip(), line[urlloc:].strip()]) for line in lines[9:]]
    cur.executemany('INSERT INTO idx VALUES (?, ?, ?, ?, ?)', records)
    print(url, 'downloaded and wrote to SQLite')

con.commit()
con.close()

# Write SQLite database to Stata
import pandas
from sqlalchemy import create_engine

engine = create_engine('sqlite:///edgar_htm_idx.db')
with engine.connect() as conn, conn.begin():
    data = pandas.read_sql_table('idx', conn)
    data.to_stata('edgar_htm_idx.dta')

# Generate the list of index files archived in EDGAR since start_year (earliest: 1993) until the most recent quarter

import datetime

current_year = datetime.date.today().year

current_quarter = (datetime.date.today().month - 1) // 3 + 1

start_year = 1993

years = list(range(start_year, current_year))

quarters = ['QTR1', 'QTR2', 'QTR3', 'QTR4']

history = [(y, q) for y in years for q in quarters]

for i in range(1, current_quarter + 1):

history.append((current_year, 'QTR%d' % i))

urls = ['https://www.sec.gov/Archives/edgar/full-index/%d/%s/crawler.idx' % (x[0], x[1]) for x in history]

urls.sort()

# Download index files and write content into SQLite

import sqlite3

import requests

con = sqlite3.connect('edgar_htm_idx.db')

cur = con.cursor()

cur.execute('DROP TABLE IF EXISTS idx')

cur.execute('CREATE TABLE idx (conm TEXT, type TEXT, cik TEXT, date TEXT, path TEXT)')

for url in urls:

lines = requests.get(url).text.splitlines()

nameloc = lines[7].find('Company Name')

typeloc = lines[7].find('Form Type')

cikloc = lines[7].find('CIK')

dateloc = lines[7].find('Date Filed')

urlloc = lines[7].find('URL')

records = [tuple([line[:typeloc].strip(), line[typeloc:cikloc].strip(), line[cikloc:dateloc].strip(),

line[dateloc:urlloc].strip(), line[urlloc:].strip()]) for line in lines[9:]]

cur.executemany('INSERT INTO idx VALUES (?, ?, ?, ?, ?)', records)

print(url, 'downloaded and wrote to SQLite')

con.commit()

con.close()

# Write SQLite database to Stata

import pandas

from sqlalchemy import create_engine

engine = create_engine('sqlite:///edgar_htm_idx.db')

with engine.connect() as conn, conn.begin():

data = pandas.read_sql_table('idx', conn)

data.to_stata('edgar_htm_idx.dta')

The first part of the code generates a dataset of the complete path information of SEC filings for the selected period (in both SQLite and Stata). Then, you can select a sample based on firm, form type, filing date, etc. and feed a CSV file to the second part of the code. The feeding CSV should look like this:

13780110,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2014,https://www.sec.gov/Archives/edgar/data/5272/0000005272-14-000007-index.htm
16212215,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,03nov2016,https://www.sec.gov/Archives/edgar/data/5272/0000005272-16-000052-index.htm
6772655,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2007,https://www.sec.gov/Archives/edgar/data/5272/0000950123-07-007211-index.htm
5671285,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2006,https://www.sec.gov/Archives/edgar/data/5272/0000950123-06-006104-index.htm
10831058,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2011,https://www.sec.gov/Archives/edgar/data/5272/0001047469-11-004647-index.htm

13780110,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2014,https://www.sec.gov/Archives/edgar/data/5272/0000005272-14-000007-index.htm

16212215,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,03nov2016,https://www.sec.gov/Archives/edgar/data/5272/0000005272-16-000052-index.htm

6772655,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2007,https://www.sec.gov/Archives/edgar/data/5272/0000950123-07-007211-index.htm

5671285,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,10may2006,https://www.sec.gov/Archives/edgar/data/5272/0000950123-06-006104-index.htm

10831058,5272,AMERICAN INTERNATIONAL GROUP INC,10-Q,05may2011,https://www.sec.gov/Archives/edgar/data/5272/0001047469-11-004647-index.htm

The second part of the code:

import csv
import random
import time

from selenium import webdriver

with open('log.csv', 'w', newline='') as log:
    logwriter = csv.writer(log)

    with open('sample.csv', newline='') as infile:
        records = csv.reader(infile)

        for r in records:
            log_row = r.copy()
            print('Start fetching URL to', r[2], r[3], 'filed on', r[4], '...')
            start_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())

            driver = webdriver.Chrome('./chromedriver')

            try:
                driver.get(r[5])
                time.sleep(3 + random.random() * 3)
                filing_date = driver.find_element_by_xpath('//*[@id="formDiv"]/div[2]/div[1]/div[2]').text
                period_of_report = driver.find_element_by_xpath('//*[@id="formDiv"]/div[2]/div[2]/div[2]').text
                form_text = driver.find_element_by_xpath('//*[@id="formDiv"]/div/table/tbody/tr[2]/td[3]/a').text
                form_link = driver.find_element_by_link_text(form_text).get_attribute('href')
                end_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())
                print('Success!', start_time, ' --> ', end_time, '\n')
                log_row = log_row + [start_time, end_time, filing_date, period_of_report, form_link]

            except:
                end_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())
                print('Error!', start_time, ' --> ', end_time, '\n')
                log_row = log_row + [start_time, end_time, 'ERROR!']

            driver.quit()

            logwriter.writerow(log_row)

import csv

import random

import time

from selenium import webdriver

with open('log.csv', 'w', newline='') as log:

logwriter = csv.writer(log)

with open('sample.csv', newline='') as infile:

records = csv.reader(infile)

for r in records:

log_row = r.copy()

print('Start fetching URL to', r[2], r[3], 'filed on', r[4], '...')

start_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())

driver = webdriver.Chrome('./chromedriver')

try:

driver.get(r[5])

time.sleep(3 + random.random() * 3)

filing_date = driver.find_element_by_xpath('//*[@id="formDiv"]/div[2]/div[1]/div[2]').text

period_of_report = driver.find_element_by_xpath('//*[@id="formDiv"]/div[2]/div[2]/div[2]').text

form_text = driver.find_element_by_xpath('//*[@id="formDiv"]/div/table/tbody/tr[2]/td[3]/a').text

form_link = driver.find_element_by_link_text(form_text).get_attribute('href')

end_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())

print('Success!', start_time, ' --> ', end_time, '\n')

log_row = log_row + [start_time, end_time, filing_date, period_of_report, form_link]

except:

end_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())

print('Error!', start_time, ' --> ', end_time, '\n')

log_row = log_row + [start_time, end_time, 'ERROR!']

driver.quit()

logwriter.writerow(log_row)

Please note:

Please use Python 3.x.
Please install all required modules such as Selenium.
The second part of the code uses Selenium. There are other ways to do the job, e.g., using BeautifulSoup.
The second part of the code only output the direct URL to the HTML-format filing; it does not download the filing.

This entry was posted in Python. Bookmark the permalink.

48 Responses to Use Python to extract URLs to HTML-format SEC filings on EDGAR

sara says:

October 31, 2017 at 5:14 pm

Hi Kai,
Thank you very much for your sharing. I am a new Pythoner. Your posts really help me a lot. I was able to run the first part of the code. For the second part of the code, it also ran and the output file is a log.csv. As you said, the csv file contains the direct URL to the HTML-formating filing. Do you know what code can I use to get the HTML-formating filing directly? To make it clear, how can I save each filing as a HTML-formating file automatically? Thanks!

Reply
- Kai Chen says:
  
  November 7, 2017 at 11:43 am
  
  Hi Sara, there are many ways to do this. Just google “python download html” or something similar, you will find the solutions (e.g., use requests or urllib module).
  
  Reply
Sylvia Li says:

November 14, 2017 at 6:14 pm

Kai,
Thank you for sharing your code. I am very new to python. I am learning python while trying to work on some data scraping from website (SEC, etc). I could run the first part without any issue. However, there are error messages when I try to run the second part, would you mind checking where might went wrong?

—————————————————————————
FileNotFoundError Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\selenium\webdriver\common\service.py in start(self)
73 close_fds=platform.system() != ‘Windows’,
—> 74 stdout=self.log_file, stderr=self.log_file)
75 except TypeError:

~\AppData\Local\Continuum\anaconda3\lib\subprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors)
708 errread, errwrite,
–> 709 restore_signals, start_new_session)
710 except:

~\AppData\Local\Continuum\anaconda3\lib\subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_start_new_session)
996 os.fspath(cwd) if cwd is not None else None,
–> 997 startupinfo)
998 finally:

FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

WebDriverException Traceback (most recent call last)
in ()
16 start_time = time.strftime(‘%Y-%m-%d %H:%M:%S’, time.localtime())
17
—> 18 driver = webdriver.Chrome(‘./chromedriver’)
19
20 try:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\selenium\webdriver\chrome\webdriver.py in __init__(self, executable_path, port, chrome_options, service_args, desired_capabilities, service_log_path)
60 service_args=service_args,
61 log_path=service_log_path)
—> 62 self.service.start()
63
64 try:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\selenium\webdriver\common\service.py in start(self)
79 raise WebDriverException(
80 “‘%s’ executable needs to be in PATH. %s” % (
—> 81 os.path.basename(self.path), self.start_error_message)
82 )
83 elif err.errno == errno.EACCES:

WebDriverException: Message: ‘chromedriver’ executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home

Reply
- Kai Chen says:
  
  November 15, 2017 at 5:53 pm
  
  Looks like you did not download “chromedriver” and specify the correct path to it in the code. Try download at https://sites.google.com/a/chromium.org/chromedriver/home and save it in the same directory as the code.
  
  Reply
  - Sylvia Li says:
    
    November 16, 2017 at 3:02 pm
    
    Kai, thank you so much! It works now.
    
    Reply
Bob says:

November 30, 2017 at 2:04 pm

Hey Kai,

thanks for your great tutorial! Its works just fine. But I got a question: Sometimes the loop “for url in urls:” in part 1 needs very long (or does not finish at all, I didnt let it run for a whole night). For example for 2017Q3 or 2011Q4 and some others. Getting all quarterly filings for 2015-2016 works fast and seems to be complete.

Do you have any idea why this is happening?

Beste regards!

Reply
- Kai Chen says:
  
  December 5, 2017 at 11:54 am
  
  Hi, thanks for your feedback. It’s an interesting question. I did a quick test on 2017Q3. It turns out to be this line—”lines = requests.get(url).text.splitlines()”—that drags the execution down. I don’t know why, but it appears to be related to the idx file itself.
  
  Reply
  - Yannick Boucher says:
    
    November 18, 2020 at 10:45 pm
    
    Replace with
    r = requests.get(url)
    r.encoding = ‘ascii’
    lines = r.text.splitlines()
    
    Reply
philip says:

January 17, 2018 at 8:07 pm

Hi. Kai
Thanks you so much for sharing your code.
I have a question for getting SIC. I opened “crawler.idx”, but I can not find SIC code.
How can I get SIC code? I inserted “sicloc=lines[7].find(‘SIC’)” , but it does not gives SIC code.
Thanks in advance

Reply
- Kai Chen says:
  
  February 17, 2018 at 12:51 am
  
  There is no SIC in “crawler.idx”. You can link CIK to SIC using Compustat. In my sample “log.csv”, the first number is an index number generated by Sqlite, and the second number is CIK (not SIC).
  
  Reply
annamalai says:

January 30, 2018 at 6:53 pm

Hi Kai, Thanks for the code.

getting an Exception with Chrome Driver. Would you be able to help

WebDriverException: chrome not reachable
(Driver info: chromedriver=2.34.522940 (1a76f96f66e3ca7b8e57d503b4dd3bccfba87af1),platform=Windows NT 6.1.7601 SP1 x86_64)

Reply
- Kai Chen says:
  
  February 17, 2018 at 12:14 am
  
  You need to install Chrome Driver separately. Please refer to technical documentation at https://github.com/SeleniumHQ/selenium/wiki/ChromeDriver
  
  Reply
Kim says:

March 20, 2018 at 2:48 pm

Hello Kai,

Thank you very much for sharing this information. This is very helpful.

I have one question while running the last part of the first code. Specifically, I get an error message when trying to convert the downloads from SQLite to Stata.

Error message:
File “C:\Users\user\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\stata.py”, line 1838, in _dtype_to_default_stata_fmt
itemsize = max_len_string_array(_ensure_object(column.values))

SystemError: returned a result with an error set

When I follow stata.py in pandas, it seems to trip over line 1839:

if itemsize > 244:
raise ValueError(excessive_string_length_error % column.name)

Would you be able to share your thoughts on this problem? Thanks again for your help!

Reply
joe says:

October 18, 2018 at 2:34 am

Hello, kai
I ran the first part and for some reason it gets stuck running the 4th QTR of year 2011.
I left it for over 2 hours and it continues to run.
Do you happen to know what the problem could be?

Reply
- Emma says:
  
  November 6, 2018 at 8:56 pm
  
  me too!!
  I cannot use requests, so i change it with urllib.request.urlopen(url) and decode(‘utf-8’) .
  
  However, in 2011 Q3, i have got an error ‘UnicodeDecodeError: ‘utf-8′ codec can’t decode byte 0xc3 in position 11344121: invalid continuation byte’
  
  do you have any idea?
  
  Reply
ANADRAYA PAI M says:

January 7, 2019 at 7:55 am

“gaierror: [Errno 11001] getaddrinfo failed”
how to solve this

getting this error in the code

ftp = ftplib.FTP(‘ftp.sec.gov’)
http://ftp.login()
daily_files = [f for f in http://ftp.nlst(‘edgar/daily-index’) if
f.startswith(‘edgar/daily-index/master’)]
daily_files.sort()

Reply
- Kai Chen says:
  
  March 19, 2019 at 12:17 am
  
  SEC has phased out the ftp server. Scroll down to the very end of my post where I have an update.
  
  Reply
Tian says:

April 4, 2019 at 11:29 am

Hi Kai,

Great post! Thank you very much for posting the code.

One minor suggestion: Instead of using chrome as driver, how about using firefox? The benefit is that we could run it “headless” and not opening the browser each time. What do you think?

Kind regards,
Tian

Reply
- Kai Chen says:
  
  April 5, 2019 at 5:15 pm
  
  You can take a look at Selenium documentation which includes headless solutions.
  
  Reply
Erika says:

May 5, 2019 at 11:25 pm

This is so helpful – Thank you so much for sharing!

Reply
Allen says:

August 6, 2019 at 4:41 pm

Hi Kai,

This is really helpful. Thanks. I was able to run the code the create the database. It takes about 50 mins. How do I update the database without re-run the code from beginning?

Let’s say now we have data from 1993 to 2019Q3. In Jan 2020, I want to update 2019 Q3 data, since 2019 Q3 has not ended yet, my data may not be full, and then add 2019 Q4 data.

How do I only pull down 2019 Q3 and Q4 data and add to original database? Also, delete the duplicate data for 2019 Q3 (which I generated before)

Thanks a lot!

Reply
- Victor says:
  
  August 7, 2019 at 6:50 pm
  
  Hi Allen,
  
  I just downloaded the 2019’s data. I may be able to answer your question.
  
  Since there are only two quarters, you can manually override the url generation code, up to line 13, with the following:
  
  history = [(2019, “QTR3”), (2019, “QTR4″)]
  urls = [‘https://www.sec.gov/Archives/edgar/full-index/%d/%s/crawler.idx’ % (x[0], x[1]) for x in history]
  
  Merge with what you have is easy. If you work with stata, you can easily merge the two datasets, and delete the duplicates.
  
  If you prefer csv, you can convert the sql file to csv by modifying the last few lines of code:
  
  with engine.connect() as conn, conn.begin():
  data = pandas.read_sql_table(‘idx’, conn)
  data.to_csv(‘edgar_idx_2019_20190801.csv’, index = False)
  
  To merge with what you have:
  
  import pandas as pd
  
  #Load the datasets
  df1 = pd.read_csv(r”old_file.csv”)
  #if you old file is in stata then: df1 = pd.read_stata(r”old_file.dbt”)
  df2 = pd.read_csv(r”new_file.csv”)
  
  #Merge
  df1 = pd.concat([df1, df2], ignore_index= True)
  #Drop duplicates
  df1 = df1.drop_duplicates()
  
  #Output the dataset
  df1.to_csv(r”merged.csv”, index = False )
  
  Reply
  - Allen says:
    
    August 11, 2019 at 1:50 pm
    
    Hi Victor,
    
    Thanks for your help, this is great. I do not use stata, but I figured out a different way to organize the database. First, run 1993 to 2018Q4 as the base database, then going forward, run calendar year only and merge back to base database. Duplicates won’t be an issue each time when I rerun the calendar year, also, only rerun the calendar year does not take long.
    
    Thanks for you response anyway. It’s helpful as well 🙂
    
    Reply
Lucy says:

August 7, 2019 at 6:13 am

Hi there: when is it that I get an Error! Message instead of Success!

Reply
Lucy says:

August 7, 2019 at 11:04 am

Hi Kai,

I ran your codes today and it doesnt seem to get past 2011 Q3? Is this a problwm with edgar?

Thankyou!

Reply
- Kai Chen says:
  
  August 7, 2019 at 4:13 pm
  
  Hi Lucy, please see the update at the beginning of my post.
  
  Reply
Dara says:

August 14, 2019 at 9:27 pm

Hi Kai,

Thank you so much!
Your posts are of great help to me.

I used your code to get the html format of SEC filing. But since I need more than 30,000 links, it is getting slower and slower. Is it because SEC has some anti-scraping measure?
How can I expedite the process?

If possible, do you provide full html format of SEC filing?

Many thanks!

Reply
Sweet T says:

August 23, 2019 at 9:47 am

Hi Kai!

Thanks for the posting!!

I think using selenium here is really an overkill. Noticing that there’s only one “htm” in each index link, we could just use beautifulsoup (like you suggested in this post) to pinpoint the only a-tag that contains the href link to the “htm”, which is what we wanted.

Lemme know what you think.

Reply
- Kai Chen says:
  
  August 23, 2019 at 10:28 am
  
  Yes, I agree. BeautifulSoup is more elegant for this task ?
  
  Reply
Jiadi Chen says:

October 8, 2019 at 10:05 am

Hi Kai,

Thanks for the posting. I’m working on a text analysis project on EDGAR and run into your website.
It is very helpful!
I did some simple words count using both txt link and html link, I find different results for all compaines/files I searched. For example,
https://www.sec.gov/Archives/edgar/data/1591890/0001493152-19-003648.text
it’s the text file of a 10-K which has word “anticipate” 9 times and “risk” 86 times.
However, when I try use the html version, https://www.sec.gov/Archives/edgar/data/1591890/000149315219003648/form10-k.htm. I got word “anticipate” 25 times and “risk” 142 times.

Is there any difference between these two versions of the same file? I understand that you are a cs expert, not a finance person. Please let me know if this happened to you before. Thanks a lot!

Best,
Jiadi

Reply
- Kai Chen says:
  
  October 8, 2019 at 10:48 am
  
  I found 39 “anticipate” in txt, while 26 in html. If you compare the hits in both txt and html, you will notice the first 26 hits are exactly the same. The extra hits in txt result from the fact that txt document also includes xml tagging content and thus generates repetition. Txt format has become notoriously unfriendly nowadays and that’s why I would advocate directing our research efforts to html format.
  
  Reply
  - Jiadi says:
    
    October 9, 2019 at 11:58 am
    
    Hi Kai,
    
    Thanks a lot for the clarification.
    I just checked your numbers are correct!
    
    Reply
Lucy says:

November 11, 2019 at 10:50 pm

Hi Kai,

Thanks very much for the posting! I am a green hand in this area, the code helped me a lot especially when I try to extract information from SEC. I can successfully generate a 2.76GB edgar_htm_idx.db file but when I run the last part of the first section code, I got the following error info:
—————————————————————————
OSError Traceback (most recent call last)
in ()
5 with engine.connect() as conn, conn.begin():
6 data = pandas.read_sql_table(‘idx’, conn)
—-> 7 data.to_stata(‘edgar_htm_idx.dta’)

~/Documents/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in to_stata(self, fname, convert_dates, write_index, encoding, byteorder, time_stamp, data_label, variable_labels, version, convert_strl)
1875 write_index=write_index,
1876 variable_labels=variable_labels, **kwargs)
-> 1877 writer.write_file()
1878
1879 def to_feather(self, fname):

~/Documents/anaconda3/lib/python3.6/site-packages/pandas/io/stata.py in write_file(self)
2215 self._write_characteristics()
2216 self._prepare_data()
-> 2217 self._write_data()
2218 self._write_strls()
2219 self._write_value_labels()

~/Documents/anaconda3/lib/python3.6/site-packages/pandas/io/stata.py in _write_data(self)
2407 def _write_data(self):
2408 data = self.data
-> 2409 self._file.write(data.tobytes())
2410
2411 def _null_terminate(self, s, as_string=False):

OSError: [Errno 22] Invalid argument

I am running it on macbook pro with 16GB RAM, I guess it might be related to the size of the stata file it tries to write to? Do you have any idea why it happens and how to fix this issue? Thanks!

Best,
Lucy

Reply
Xi Chen says:

December 15, 2019 at 6:16 pm

Hi Kai,

Thanks for your code. I can get results from the first part code.
However, I fail to get output from the second part code. I got the error in log.csv file. I did not have deeper HTML information when I execute the second part code.

Start fetching URL to 1414953 28/10/2019 filed on https://www.sec.gov/Archives/edgar/data/1414953/0001607062-19-000411-index.htm …
Error! 2019-12-15 23:11:23 –> 2019-12-15 23:11:25

Start fetching URL to 1414953 28/10/2019 filed on https://www.sec.gov/Archives/edgar/data/1414953/0001607062-19-000411-index.htm …
Error! 2019-12-15 23:11:23 –> 2019-12-15 23:11:25

Reply
Eduardo says:

June 25, 2020 at 7:15 pm

Hi Kai,

Your code is all I needed, thanks a lot!!!

As other mentioned, there are two problems in years 2011 and 2017 that break the program, but as you mentioned, are related with the EDGAR files.

I used your code and made some changes to get the information from another EDGAR file that have the same information, but instead of having one idx file for the each quarter, have one idx file for each day of the quarter. It takes a litle more time to process but avoid the problems in the 2017 year.

Also I changed the starting year for a starting date, so anyone can update the SQLite table from the last time the table was updated.

I am only interested of having the information in SQLite so the code finish there.

I hope this code can help others who wanted to get all the EDGAR history files.

Thanks again and Good Luck!

Eduardo

********************************************************************
import datetime

# You can chose the date you want to starl

start_date = “2020-02-19”

start_year = int(start_date[:4])
current_year = datetime.date.today().year
years = list(range(start_year + 1, current_year))

start_quarter = (int(start_date[5:7]) – 1) // 3 + 1

history = []

year = current_year – start_year

# if the start year is different of current year, it takes the remaining quarters of the start year
if year > 0:

for i in range(start_quarter, 5):

history.append((start_year, ‘QTR%d’ % i))

# if there are year that are in between of the start and current years, it takes all the quarters of that years
if year > 1:
quarters = [‘QTR1’, ‘QTR2’, ‘QTR3’, ‘QTR4’]
history.extend([(y, q) for y in years for q in quarters])

current_quarter = (datetime.date.today().month – 1) // 3 + 1

# finnally it adds the current year quarters
for i in range(1, current_quarter + 1):

history.extend([(current_year, ‘QTR%d’ % i)])

# now history has the total quarters from the starting quarter to current quarter

base_url = “https://www.sec.gov/Archives/edgar/daily-index/”
QTR = “%d/%s/”
urls = [base_url + QTR % (x[0], x[1]) for x in history]

# Download index files and write content into SQLite
import sqlite3
import requests
import urllib

con = sqlite3.connect(‘edgar_htm_idx.db’)
cur = con.cursor()

cur.execute(‘CREATE TABLE IF NOT EXISTS idx (conm TEXT, type TEXT, cik TEXT, date TEXT, path TEXT)’)

for url in urls:
url1 = url + “index.json”
print(url1)
file_content = requests.get(url1)
decoded_content = file_content.json()

print(‘-‘*68)
print(‘Pulling files’)

# for each file in the directory items list, print the file type and file href.
# AGAIN DOING A SUBSET

start_file_date = datetime.datetime(2020, 2, 19)
for file in decoded_content[‘directory’][‘item’]:

if file[“name”][0:7] == “crawler”:

date = file[“name”][8:16]

file_date = datetime.datetime(int(file[“name”][8:12]), int(file[“name”][12:14]), int(file[“name”][14:16]))

if file_date >= start_file_date:
file_url = url + file[‘name’]
date = file[“name”][8:16]
print(“Processing working day:”, date, “in Edgar File:”, url[47:56])
lines = requests.get(file_url).text.splitlines()
typeloc = lines[7].find(‘Form Type’)
cikloc = lines[7].find(‘CIK’)
dateloc = cikloc + 7
urlloc = cikloc + 20

records = [tuple([line[:typeloc].strip(), line[typeloc:cikloc].strip(), line[cikloc:dateloc].strip(),
line[dateloc:urlloc].strip(), line[urlloc:].strip()]) for line in lines[9:]]
cur.executemany(‘INSERT INTO idx VALUES (?, ?, ?, ?, ?)’, records)

print(url, ‘downloaded and wrote to SQLite’)
print(“All files recorded in Database”)
con.commit()
con.close()

Reply
- Kai Chen says:
  
  June 26, 2020 at 3:22 pm
  
  Thank you Eduardo, for sharing the code. It’s very useful.
  
  Reply
Adyashree says:

August 7, 2020 at 9:54 am

Hey, Kai. Can I use these urls which are generated in these code to get metadata of the values in each financial statements tables like cash flows, balance sheets, etc?

Reply
Thomas says:

November 17, 2020 at 8:54 pm

Hi Kai, I’m having issues using the “second part” of the code to code. I’m not sure how to get that code to produce a CSV with a URL for a company and the time period I want.
Any suggestions or examples that you have would be super appreciated. I’ve spent several hours trying to troubleshoot this myself with no prevail.

Reply
- John says:
  
  February 10, 2021 at 2:15 pm
  
  Hey, did you have any luck figuring this out? I have the same problem.
  
  Reply
Nicole says:

May 21, 2021 at 9:25 am

Hello Kai, I am working on data in annual report. Could you show how to extract URLs? I am wondering the code without the “quarter”. Very appreciate your help!

Reply
Yanhua Dong says:

September 15, 2021 at 12:19 am

Hi Kai,

The original post links (Part I and II) do not work.

[Original post] I wrote two posts to describe how to download TXT-format SEC filings on EDGAR:

Use Python to download TXT-format SEC filings on EDGAR (Part I)
Use Python to download TXT-format SEC filings on EDGAR (Part II)

Could you help?

Thanks a lo.t

Reply
- Kai Chen says:
  
  September 28, 2021 at 5:10 pm
  
  Thanks for spotting the broken links. Corrected.
  
  Reply
Gram Gao says:

December 10, 2021 at 11:58 pm

Hi Kai,

Thank you for sharing your code. I have to say your code helped me generated several projects that leads to good publications. Thank you!
I am working on a new project that needs some updated SEC filings from 2020. After I run the 1st part, the Stata file output doesn’t seem to provide the correct data as before? There are some weird special characters in the Stata file instead of the ussal name, cik, path, etc. I am not sure if it is because updates from SEC side. Would you mind take a look at the issue? Thank You.

Reply
- Evans says:
  
  January 2, 2022 at 12:02 am
  
  I am having the same issue here. Can you please check for us.
  Cheers
  Evans
  
  Reply
- Amy Chen says:
  
  February 18, 2022 at 10:47 pm
  
  Hi Kai and Gram,
  
  I also have some weird outcome in the Stata file, including one message saying that the request should not be more than 10 times per second. I googled and figured that’s the new rule that SEC applies. Check here: https://www.sec.gov/oit/announcement/new-rate-control-limits
  Do you have any solutions to get around this issue? Thank you so much for what you do and contribute to the academic community!
  
  Reply
  - A says:
    
    May 4, 2022 at 3:35 pm
    
    Hi,
    
    Add the same issue, and found a workaround here (https://gist.github.com/dougvk/8499335):
    
    You need to declare that you user type per SEC guidelines.
    
    #add this line
    headers = {“user-agent”: “Safari”}
    
    Then for both the following lines of code in Eduardo’s code add headers:
    # file_content = requests.get(url1, headers = headers)
    # lines = requests.get(file_url, headers = headers).text.splitlines()
    
    Reply
    - Zeal says:
      
      July 13, 2023 at 11:43 pm
      
      Dear,
      
      Many thanks for this sharing this helpful information! In a similar vein, I can also edit Kai’s original code (the requests line) to obtain similar results as Eduardo’s. This does help me a lot!
      
      Reply
CC says:

July 17, 2022 at 5:04 pm

Hi Kai, thanks a lot for sharing the codes! As you mentioned, htm-format filings are easier to parse. However, there seem to be two situations where htm-format filings may not work: (1) older filings do not have the htm-format (e.g., Starbucks has the txt-format filings for its 2000’s 10-K: https://www.sec.gov/Archives/edgar/data/0000829224/000089102000002203/0000891020-00-002203-index.htm), so it seems that I couldn’t get the html-url in this case; (2) some firms report financial statements and other important information in exhibits, and these exhibits may or may not be in the htm-format (e.g., Starbucks has the htm-formt for the main body of its 2002’s 10-K but has txt-format for its exhibits).

For those firm-years that do not use the htm-format filings, we would still need to use the txt-format filings, right?

Given the various formats of the filings, I feel that the best (most accurate) way is to downlod all txt-format filings, and keep all documents that have the type we want (e.g., “10-K” and those started with “Ex-“). What do you think? Thank you!

Reply

Use Python to extract URLs to HTML-format SEC filings on EDGAR

48 Responses to Use Python to extract URLs to HTML-format SEC filings on EDGAR

Leave a Reply Cancel reply

Categories

Archives

Site Admin