Use Python to extract URLs to HTML-format SEC filings on EDGAR

[Update on 2019-08-07] From time to time, some readers informed that the first-part code seemingly stopped at certain quarters. I don’t know the exact reason (perhaps a server-side issue). I never encountered the issue. I would suggest that you just try again later. I also share a Dropbox link from which you can download the first-part results (as of 2019-08-07; 2.4GB) in the CSV format (link). Please note—as I have explained in my original post, the URL contained in the downloadable CSV is not the URL to the HTML-format filing; it is just the URL to an index page. You need to select your sample and go through the second-part code to get the URL to the HTML-format filing.

[Original post] I wrote two posts to describe how to download TXT-format SEC filings on EDGAR:

Although TXT-format files have benefits of easy further handling, they are oftentimes not well formatted and thus hard to read. A HTML-format 10-K is more pleasing to eyes. Actually, SEC also provides the paths (namely, URLs) to HTML-format filings. With the path, we can open a HTML-format filing in a web browser, or further download the filing as a PDF.

There remain two parts in the Python code. In the first part, we need download the path data. Instead of using master.idx in the above two posts, we need use crawler.idx for this task. The path we get will be a URL like this:

https://www.sec.gov/Archives/edgar/data/859747/0001477932-16-007969-index.htm

Note that the path we get is a URL to an index page, not a URL to the HTML-format 10-Q in this example. To get the direct URL to the HTML-format 10-Q, we have to go one-level deeper. The second part of the Python code is used to go that deeper and extract the direct URL to the main body of the Form (the URL embedded in the first row in more than 99% cases). The code also extracts such information as filing date and period of report on the index page. The code writes the output (including filing date, period of report and direct URL) in log.csv. The following is an output example—the first URL is the path we get in the first part of the code; the second URL is the direct URL to the HTML-format Form.

The first part of the code:

The first part of the code generates a dataset of the complete path information of SEC filings for the selected period (in both SQLite and Stata). Then, you can select a sample based on firm, form type, filing date, etc. and feed a CSV file to the second part of the code. The feeding CSV should look like this:

The second part of the code:

Please note:

  1. Please use Python 3.x.
  2. Please install all required modules such as Selenium.
  3. The second part of the code uses Selenium. There are other ways to do the job, e.g., using BeautifulSoup.
  4. The second part of the code only output the direct URL to the HTML-format filing; it does not download the filing.
This entry was posted in Python. Bookmark the permalink.

26 Responses to Use Python to extract URLs to HTML-format SEC filings on EDGAR

  1. sara says:

    Hi Kai,
    Thank you very much for your sharing. I am a new Pythoner. Your posts really help me a lot. I was able to run the first part of the code. For the second part of the code, it also ran and the output file is a log.csv. As you said, the csv file contains the direct URL to the HTML-formating filing. Do you know what code can I use to get the HTML-formating filing directly? To make it clear, how can I save each filing as a HTML-formating file automatically? Thanks!

    • Kai Chen says:

      Hi Sara, there are many ways to do this. Just google “python download html” or something similar, you will find the solutions (e.g., use requests or urllib module).

  2. Sylvia Li says:

    Kai,
    Thank you for sharing your code. I am very new to python. I am learning python while trying to work on some data scraping from website (SEC, etc). I could run the first part without any issue. However, there are error messages when I try to run the second part, would you mind checking where might went wrong?

    —————————————————————————
    FileNotFoundError Traceback (most recent call last)
    ~\AppData\Local\Continuum\anaconda3\lib\site-packages\selenium\webdriver\common\service.py in start(self)
    73 close_fds=platform.system() != ‘Windows’,
    —> 74 stdout=self.log_file, stderr=self.log_file)
    75 except TypeError:

    ~\AppData\Local\Continuum\anaconda3\lib\subprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors)
    708 errread, errwrite,
    –> 709 restore_signals, start_new_session)
    710 except:

    ~\AppData\Local\Continuum\anaconda3\lib\subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_start_new_session)
    996 os.fspath(cwd) if cwd is not None else None,
    –> 997 startupinfo)
    998 finally:

    FileNotFoundError: [WinError 2] The system cannot find the file specified

    During handling of the above exception, another exception occurred:

    WebDriverException Traceback (most recent call last)
    in ()
    16 start_time = time.strftime(‘%Y-%m-%d %H:%M:%S’, time.localtime())
    17
    —> 18 driver = webdriver.Chrome(‘./chromedriver’)
    19
    20 try:

    ~\AppData\Local\Continuum\anaconda3\lib\site-packages\selenium\webdriver\chrome\webdriver.py in __init__(self, executable_path, port, chrome_options, service_args, desired_capabilities, service_log_path)
    60 service_args=service_args,
    61 log_path=service_log_path)
    —> 62 self.service.start()
    63
    64 try:

    ~\AppData\Local\Continuum\anaconda3\lib\site-packages\selenium\webdriver\common\service.py in start(self)
    79 raise WebDriverException(
    80 “‘%s’ executable needs to be in PATH. %s” % (
    —> 81 os.path.basename(self.path), self.start_error_message)
    82 )
    83 elif err.errno == errno.EACCES:

    WebDriverException: Message: ‘chromedriver’ executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home

  3. Bob says:

    Hey Kai,

    thanks for your great tutorial! Its works just fine. But I got a question: Sometimes the loop “for url in urls:” in part 1 needs very long (or does not finish at all, I didnt let it run for a whole night). For example for 2017Q3 or 2011Q4 and some others. Getting all quarterly filings for 2015-2016 works fast and seems to be complete.

    Do you have any idea why this is happening?

    Beste regards!

    • Kai Chen says:

      Hi, thanks for your feedback. It’s an interesting question. I did a quick test on 2017Q3. It turns out to be this line—”lines = requests.get(url).text.splitlines()”—that drags the execution down. I don’t know why, but it appears to be related to the idx file itself.

  4. philip says:

    Hi. Kai
    Thanks you so much for sharing your code.
    I have a question for getting SIC. I opened “crawler.idx”, but I can not find SIC code.
    How can I get SIC code? I inserted “sicloc=lines[7].find(‘SIC’)” , but it does not gives SIC code.
    Thanks in advance

    • Kai Chen says:

      There is no SIC in “crawler.idx”. You can link CIK to SIC using Compustat. In my sample “log.csv”, the first number is an index number generated by Sqlite, and the second number is CIK (not SIC).

  5. annamalai says:

    Hi Kai, Thanks for the code.

    getting an Exception with Chrome Driver. Would you be able to help

    WebDriverException: chrome not reachable
    (Driver info: chromedriver=2.34.522940 (1a76f96f66e3ca7b8e57d503b4dd3bccfba87af1),platform=Windows NT 6.1.7601 SP1 x86_64)

  6. Kim says:

    Hello Kai,

    Thank you very much for sharing this information. This is very helpful.

    I have one question while running the last part of the first code. Specifically, I get an error message when trying to convert the downloads from SQLite to Stata.

    Error message:
    File “C:\Users\user\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\stata.py”, line 1838, in _dtype_to_default_stata_fmt
    itemsize = max_len_string_array(_ensure_object(column.values))

    SystemError: returned a result with an error set

    When I follow stata.py in pandas, it seems to trip over line 1839:

    if itemsize > 244:
    raise ValueError(excessive_string_length_error % column.name)

    Would you be able to share your thoughts on this problem? Thanks again for your help!

  7. joe says:

    Hello, kai
    I ran the first part and for some reason it gets stuck running the 4th QTR of year 2011.
    I left it for over 2 hours and it continues to run.
    Do you happen to know what the problem could be?

    • Emma says:

      me too!!
      I cannot use requests, so i change it with urllib.request.urlopen(url) and decode(‘utf-8’) .

      However, in 2011 Q3, i have got an error ‘UnicodeDecodeError: ‘utf-8′ codec can’t decode byte 0xc3 in position 11344121: invalid continuation byte’

      do you have any idea?

  8. ANADRAYA PAI M says:

    “gaierror: [Errno 11001] getaddrinfo failed”
    how to solve this

    getting this error in the code

    ftp = ftplib.FTP(‘ftp.sec.gov’)
    http://ftp.login()
    daily_files = [f for f in http://ftp.nlst(‘edgar/daily-index’) if
    f.startswith(‘edgar/daily-index/master’)]
    daily_files.sort()

  9. Tian says:

    Hi Kai,

    Great post! Thank you very much for posting the code.

    One minor suggestion: Instead of using chrome as driver, how about using firefox? The benefit is that we could run it “headless” and not opening the browser each time. What do you think?

    Kind regards,
    Tian

  10. Erika says:

    This is so helpful – Thank you so much for sharing!

  11. Allen says:

    Hi Kai,

    This is really helpful. Thanks. I was able to run the code the create the database. It takes about 50 mins. How do I update the database without re-run the code from beginning?

    Let’s say now we have data from 1993 to 2019Q3. In Jan 2020, I want to update 2019 Q3 data, since 2019 Q3 has not ended yet, my data may not be full, and then add 2019 Q4 data.

    How do I only pull down 2019 Q3 and Q4 data and add to original database? Also, delete the duplicate data for 2019 Q3 (which I generated before)

    Thanks a lot!

    • Victor says:

      Hi Allen,

      I just downloaded the 2019’s data. I may be able to answer your question.

      Since there are only two quarters, you can manually override the url generation code, up to line 13, with the following:

      history = [(2019, “QTR3”), (2019, “QTR4″)]
      urls = [‘https://www.sec.gov/Archives/edgar/full-index/%d/%s/crawler.idx’ % (x[0], x[1]) for x in history]

      Merge with what you have is easy. If you work with stata, you can easily merge the two datasets, and delete the duplicates.

      If you prefer csv, you can convert the sql file to csv by modifying the last few lines of code:

      with engine.connect() as conn, conn.begin():
      data = pandas.read_sql_table(‘idx’, conn)
      data.to_csv(‘edgar_idx_2019_20190801.csv’, index = False)

      To merge with what you have:

      import pandas as pd

      #Load the datasets
      df1 = pd.read_csv(r”old_file.csv”)
      #if you old file is in stata then: df1 = pd.read_stata(r”old_file.dbt”)
      df2 = pd.read_csv(r”new_file.csv”)

      #Merge
      df1 = pd.concat([df1, df2], ignore_index= True)
      #Drop duplicates
      df1 = df1.drop_duplicates()

      #Output the dataset
      df1.to_csv(r”merged.csv”, index = False )

      • Allen says:

        Hi Victor,

        Thanks for your help, this is great. I do not use stata, but I figured out a different way to organize the database. First, run 1993 to 2018Q4 as the base database, then going forward, run calendar year only and merge back to base database. Duplicates won’t be an issue each time when I rerun the calendar year, also, only rerun the calendar year does not take long.

        Thanks for you response anyway. It’s helpful as well 🙂

  12. Lucy says:

    Hi there: when is it that I get an Error! Message instead of Success!

  13. Lucy says:

    Hi Kai,

    I ran your codes today and it doesnt seem to get past 2011 Q3? Is this a problwm with edgar?

    Thankyou!

  14. Dara says:

    Hi Kai,

    Thank you so much!
    Your posts are of great help to me.

    I used your code to get the html format of SEC filing. But since I need more than 30,000 links, it is getting slower and slower. Is it because SEC has some anti-scraping measure?
    How can I expedite the process?

    If possible, do you provide full html format of SEC filing?

    Many thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *