Update on October 6, 2018: As I acknowledged in the very first edition of this post, I borrowed some codes from Edouard Swiac’s Python module “python-edgar” (version: 1.0). Edouard kindly informed me that he had updated his module (see his GitHub page). The major updates to his module include: (1) he migrated the file download from FTP to HTTPS and (2) added parallel downloads so now it is faster to rebuild the full index, especially if going all the way to 1993. My initial thoughts about his updated module is that it provides more flexibility and should be more robust than mine. Thank you Edouard for your work!

Update on March 3, 2017: SEC closed the FTP server permanently on December 30, 2016 and started to use a more secure transmission protocol—https. Since then I have received several requests to update the script. Here it is the new script for Part I.

The technical details may be too boring to most people. So, I provide multiple downloadable Stata datasets that include all index files from 1993 Q1 to October 6, 2018.

Stata format (1993–2000); Stata format (2001–2005); Stata format (2006–2010); Stata format (2011–2015); Stata format (2016–2019/03/16)

Original Post: We know that SEC makes company filings (e.g.,10-Ks, 10-Qs and 8-Ks) publicly available on EDGAR. The web search interface is convenient, but we may need to bulk download raw text filings. SEC provides an anonymous EDGAR FTP server to access raw text filings (Update on March 3, 2017: SEC closed the FTP server permanently on December 30, 2016. So the description about the FTP server might be not applicable thereafter. But the basic idea about the URLs to raw text filings remain unchanged). Usually, if we know the path or URL to a file on an FTP server, we can easily use an Internet browser or an FTP software to connect to the server and download the file. For example, if we navigate a bit on the EDGAR FTP server, we can find the path to the file “master.idx” as follows:

ftp://ftp.sec.gov/edgar/full-index/2015/QTR4/master.idx

Copy the path into an Internet browser or an FTP software, we can download the file directly.

In the above example, we can find the path to “master.idx” by navigating on the EDGAR FTP server. But we cannot find any path to any raw text filing. In other words, paths to raw text filings are not visible by simply looking into the EDGAR FTP server. SEC purposely hides paths to raw text filings to reduce server load and avoid data abuse.

1. Find paths to raw text filings;
2. Select what we want and bulk download raw text filings from the EDGAR FTP server using paths we have obtained in the first step.

This post describes the first step, and I elaborate the second step in another post.

SEC stores all path information in index files. See technical details here. Let’s take a snap shot of an index file:

The last field on a line in the main body of the index file shows the path to a real raw text filing. What we have to do in the first step is to download and parse all index files and write the content into a database. Then in the second step, we can execute any query into the database (e.g., select certain form type or certain period of time) and download raw text filings using selected paths.

I write the following Python program to execute the first step. This program borrows some codes from Edouard Swiac’s Python module “python-edgar” (version: 1.0). Please see his package information page here.

Please note: my program stores all paths in an SQLite database. I personally like the lightweight database product very much. The last few lines of my program transfer data from the SQLite database to an Stata dataset for users who are not familiar with SQLite. To do so, I use two Python modules: pandas and sqlalchemy which you have to install using pip command on your own. Please google documentations of SQLite, Pandas, and SQLAchemy if you have installation problems. I am using Python 3.x in all my Python posts.

I find two articles explain how to use R and Perl to achieve the same functionality. I include the links (R or Perl) for users who are more comfortable with R or Perl.

This entry was posted in Data, Python. Bookmark the permalink.

63 Responses to Use Python to download TXT-format SEC filings on EDGAR (Part I)

1. Alexandra says:

Hi Kai, I have found your page while checking for ways to download SEC EDGAR files. I am very new to this topic and was wondering if you could help me with some questions I have regarding the use of Python for such a massive download? I would be happy to send you an email, but please let me know if you prefer me to ask you as a comment related to your post instead. Thanks!

2. Cliff says:

Thanks for the data, but I have a hard time to extract this file (around 395.95MB). The upzipped file size seems to be about 1.7GB, rather than 19GB. And I also have an error when the extracting process is close to 100%. Any thoughts?

P.S. I use 7-z to extract it.

• Kai Chen says:

Hi Cliff, the unzipped SAS data file is 19G on my laptop. I put a link to the mega file so that you can download it directly: https://www.dropbox.com/s/chevzd830h4vtdz/edgaridx.sas7bdat?dl=0

As to the unzipping stopping at nearly 100%, maybe you can use other unzip programs such as WinRAR for better compatibility.

• Harvey Wang says:

Dear Kai:
I don’t know how to handle this size of data. I am a pythoner now? Any suggestion?

• Harvey Wang says:

now i use python sas read library to solve it ..anyway ,thx

3. Cliff says:

Hi Kai,

Thank you! I just saw the reply. (I thought I would receive an email from you when I got a reply:p)
I found another website also includes the Edgar masterfile (http://www.wrds.us/index.php/repository/view/25)
When I compare that one with yours, basically the two are quite similar, almost the same amount of observations. One thing I notice is that yours have ID, and the length of your variables is 255. Therefore their size is 1.9 GB, yours is 19 GB. Good job anyway!

4. Kal Alsabah says:

Hi Kai,

Thanks for sharing this post. I’m facing an error when I run it with Python 2.7 (using Windows) :

AttributeError: ‘module’ object has no attribute ‘FTP_ADDR’

I tried importing edgar.FTP_ADDR but i get an import error:

ImportError: No module named FTP

Any idea how to resolve it.

Best,

Kal

• Kai Chen says:

Did you install the package “python-edgar” properly?

• Kal Alsabah says:

Yes, the installation was successful. When I re-install it I get:

5. Scott says:

This is great, but when i run the code I get “sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.”

Is there some way to covert this to utf-8?
Thanks,

6. svet says:

are 13F filings included in this file? Thanks!

7. grateful guy says:

Wow, this was really helpful. It took me a while to figure out
[Errno 2] No such file or directory

I had to make sure that the directory existed.
http://stackoverflow.com/questions/2401628/open-file-in-w-mode-ioerror-errno-2-no-such-file-or-directory

8. John White says:

Excellent work. We are a boutique Investment Bank based in Chicago and Dallas. Specializing in marketing private offerings publicly via the JOBS Act/SEC guidelines. http://www.cassonmediagroup.com

One of the tools we are working on is providing free, user-friendly access to all Private Offering filed (Form D) plus all offerings that have been posted on the “Equity Crowdfunding” sites (that have not yet filed): a manual aggregation project.

Our thinking at this point is to not only extract via FTP, etc. all the new Form D filings every day plus attached docs. But to build a “web widget” app that can be embedded on any website (similar to stock quote widgets) giving the user a fast and painless way to access/search private offerings.

Also, we have a mobile app (iPad, iPhone, IOS, Android) that would access the data.
At no point do plan to monetize access, all would be Open Source and free. We would ask name and email to access to use the app.

We would deeply appreciate any assistance that you could provide. Paid consulting gig, a 30-minute phone conference, design review, anything!

I look forward to hearing from you.

John White
847-867-5911

9. Tayyaba says:

Hi Kai Chen,
Great post. I was wondering where I can find the part II of this post as I want only 10-k and 10-q forms. Secondly, I’m able to create the ‘idx’ table, but somehow it’s not getting inserted with the values. Is it because, I always end up with an error?

Tayyaba

• Kai Chen says:

Hi Tayyaba, I have not started part II but will do it soon. If you could send me all log information, I may be able to find why values not being inserted.

• Tayyaba says:

Hi Kai, Thanks for the reply. I have Mysql DB and I think the format of “records” isn’t the one which mysql excepts. So, I downloaded your .dta file and converted it into CSV format and will import the csv files into the table. But there’s one issue I’m having right now, the url or path we have in the table is in txt format however I want it in .htm format. I’m working on it.

This question on stackoverflow is the one which addresses my problem:
http://stackoverflow.com/questions/33979261/html-rendering-of-edgar-txt-filings

• Kai Chen says:

Hi Tayyaba,

Part II is out: http://www.kaikaichen.com/?p=681

I read your post on Stackoverflow. In my codes I retrieve “master.idx” because the goal is to download and process raw text filings for further textual analysis. To achieve your goal, you can try to retrieve another type of index file: “crawler.idx”. See the example ftp://ftp.sec.gov/edgar/daily-index/crawler.20160201.idx. That path leads to the html version. You still need some tweak (go one level deeper) but that should be doable.

Hope this helps.

Kai

• Tayyaba says:

Hey Kai, Thank you so much, really appreciate it.

10. Eva Lee says:

hi Kai,

Great job. May I know why there is a loop i from 0 to 9 in the following code?
with zipfile.ZipFile(temp).open(‘master.idx’) as z:
for i in range(10):

Thank you!

• Kai Chen says:

Hi Eva, if you look into the index file ‘master.idx’, you’ll notice the first ten lines are titles, column names, etc. The loop is used to skip the first ten lines and reach the “true” records.

• Eva says:

Thank you very much!

11. Kwan says:

Hello Kai,

Thanks so much for your posting. It helps me a lot. I was thinking about doing the same thing but do not have enough skills to write a python scripts like this. I appreciate your effort.
I emulated the script on Python 3.5.2 and was able to create a sqlite db file. But I get erros when I try to export it as dta file. So I just also downloaded the dta file but want to know what is wrong. Below is the error message.

—————————————————————————
OperationalError Traceback (most recent call last)
C:\Anaconda3\lib\site-packages\sqlalchemy\engine\result.py in fetchall(self)
963 try:
–> 964 l = self.process_rows(self._fetchall_impl())
965 self._soft_close()

C:\Anaconda3\lib\site-packages\sqlalchemy\engine\result.py in _fetchall_impl(self)
914 try:
–> 915 return self.cursor.fetchall()
916 except AttributeError:

OperationalError:

The above exception was the direct cause of the following exception:

OperationalError Traceback (most recent call last)
in ()
5 engine = create_engine(‘sqlite:///edgar_idx.db’)
6 with engine.connect() as conn, conn.begin():
—-> 7 data = pandas.read_sql_table(‘idx’, conn)
8 data.to_stata(‘edgar_idx.dta’)

C:\Anaconda3\lib\site-packages\pandas\io\sql.py in read_sql_table(table_name, con, schema, index_col, coerce_float, parse_dates, columns, chunksize)
363 table_name, index_col=index_col, coerce_float=coerce_float,
–> 364 parse_dates=parse_dates, columns=columns, chunksize=chunksize)
365
366 if table is not None:

C:\Anaconda3\lib\site-packages\pandas\io\sql.py in read_table(self, table_name, index_col, coerce_float, parse_dates, columns, schema, chunksize)
1129 parse_dates=parse_dates, columns=columns,
-> 1130 chunksize=chunksize)
1131
1132 @staticmethod

C:\Anaconda3\lib\site-packages\pandas\io\sql.py in read(self, coerce_float, parse_dates, columns, chunksize)
809 parse_dates=parse_dates)
810 else:
–> 811 data = result.fetchall()
812 self.frame = DataFrame.from_records(
813 data, columns=column_names, coerce_float=coerce_float)

C:\Anaconda3\lib\site-packages\sqlalchemy\engine\result.py in fetchall(self)
968 self.connection._handle_dbapi_exception(
969 e, None, None,
–> 970 self.cursor, self.context)
971
972 def fetchmany(self, size=None):

C:\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py in _handle_dbapi_exception(self, e, statement, parameters, cursor, context)
1339 util.raise_from_cause(
1340 sqlalchemy_exception,
-> 1341 exc_info
1342 )
1343 else:

C:\Anaconda3\lib\site-packages\sqlalchemy\util\compat.py in raise_from_cause(exception, exc_info)
200 exc_type, exc_value, exc_tb = exc_info
201 cause = exc_value if exc_value is not exception else None
–> 202 reraise(type(exception), exception, tb=exc_tb, cause=cause)
203
204 if py3k:

C:\Anaconda3\lib\site-packages\sqlalchemy\util\compat.py in reraise(tp, value, tb, cause)
183 value.__cause__ = cause
184 if value.__traceback__ is not tb:
–> 185 raise value.with_traceback(tb)
186 raise value
187

C:\Anaconda3\lib\site-packages\sqlalchemy\engine\result.py in fetchall(self)
962
963 try:
–> 964 l = self.process_rows(self._fetchall_impl())
965 self._soft_close()
966 return l

C:\Anaconda3\lib\site-packages\sqlalchemy\engine\result.py in _fetchall_impl(self)
913 def _fetchall_impl(self):
914 try:
–> 915 return self.cursor.fetchall()
916 except AttributeError:
917 return self._non_result([])

OperationalError: (sqlite3.OperationalError)

• Kai Chen says:

Hi Kwan,

I don’t know the exact cause based on the error log, but I guess it may be related to the relative and absolute path. Try this: first, check the file size of “edgar_idx.db” you got. If it is a sizeable file, the problem should exist in the conversion of Sqlite to Stata; second, use the absolute path in the conversion step: engine = create_engine(r’sqlite:///C:\path\to\edgar_idx.db’). Replace ‘C:\path\to\’ with your own absolute path.

Let me know if this works.

Kai

• Kwan says:

Hello Kai,

I spent some time reverse engineering and revising your script to suit my need. And your suggestion worked!! Thanks so much!!

• Kwan says:

I was so excited when python was running but unfortunately, at the end, it gave an operationalError. would it be due to the file size, the first possibility you mentioned above. Could you explain how to deal with big db files?

• Kai Chen says:

I didn’t try my codes on a Windows machine. The first step is to make sure you have generated a Sqlite database correctly. If the database has gathered the data as it is supposed to have, the size of the database should be about 2G. The first step helps you to decide at which stage the error occurs. If you can get the correct database file but just cannot export the data to another software, you can instead Google other exporting methods (there are tons of other methods).

12. Kwan says:

Thanks for your reply, Kai. I have encountered errors a few times before running scripts written on Mac. It is more than a backslash or front slash issue but could not pinpoint the cause. I will figure it out and get back to you when I solve the issue. Thanks!

13. agosta says:

Thank you Kai for sharing this tutorial – after tweaking your code to better suit my environment it worked on the first try! I do, however, see something odd about the data returned. Although my sqlite db is about 2gb as you said it should be (it’s actually 1.7gb with another table in it), I find that I was only able to index ~16.4M filings when I ran the script today.

However, this article written back in 2014 (http://tinyurl.com/jupr3zy), indicates that there should’ve been at least 25M filings. I’ve seen various other articles that note there should be at least 20M.

Do you have any idea where the other ~5M to ~10M indexes are?

Thanks again for sharing!

• Kai Chen says:

You’re welcome. I have no idea about the other 5–10M indexes, but I believe the code is able to grab all indexes currently populated on SEC’s ftp server.

14. erm3nda says:

For people using that, ftp will shutdown dec ’16. It is publised on their site. https://www.sec.gov/edgar/searchedgar/ftpusers.htm

15. David says:

Does this return all the 10k for every company in the russell3000?

Would like to know what does it extract. I had data base programming using SQLite to extracts tweets into database. I would like to know what does this do in specific?

16. David says:

So I get the name of the company, ticker and path. Then how can I use the path to iterate over the text file?

17. Manuel says:

UNfortunately, EDGAR access will be shutdown by the end of the year 🙁
https://www.sec.gov/edgar/searchedgar/ftpusers.htm

18. HDT says:

Regarding the SEC shutting down the FTP server on 12/31:

Replacing the FTP bits in the script with urllib2 and using https://www.sec.gov/Archives/edgar/full-index etc. as the base will return exactly the same data, and the rest of the script should work accordingly.

• LzG says:

HDT, can you elaborate more? I’m in the process of refactoring for urllib2 and am having difficulties with method statements.

19. Tim says:

Hi HDT,

I would also appreciate it if you could reply. Specifically if you could expand on what exactly entails replacing the FTP bits in the script with urllib2.

Thank you!

20. Xiang says:

Thanks for your code! I am using your code to successfully download proxy statement in txt format. Since the htm file is better formatted, I try to revise your code to download crawler.idx. But I am stuck in the code:
cur.executemany(‘INSERT INTO idx VALUES (?, ?, ?, ?, ?)’, records),
as I don’t how to set line.split for crawler.idx. Can you help to revise the code? Thanks in advance!

• Kai Chen says:

Please substitute the “records = ” line with the following several lines:

nameloc = lines[7].find(‘Company Name’)
typeloc = lines[7].find(‘Form Type’)
cikloc = lines[7].find(‘CIK’)
dateloc = lines[7].find(‘Date Filed’)
urlloc = lines[7].find(‘URL’)
records = [tuple([line[:typeloc].strip(), line[typeloc:cikloc].strip(), line[cikloc:dateloc].strip(), line[dateloc:urlloc].strip(), line[urlloc:].strip()]) for line in lines[9:]]

But note the URL you obtain is not the URL to an html version Form (see one example of such url https://www.sec.gov/Archives/edgar/data/859747/0001477932-16-007969-index.htm). You still have to go one level deeper, which I don’t know how to do yet.

21. Yulin Chen says:

Hi Kai

That is really useful code that can be reused in many scenarios. I am trying to create a database which contains detail information from 13F-HR. Now i did it by getting all raw txt/xml data based on your code 🙂

Now something frustrate(little bit..) me is … looks like there are all raw txt file for 13F until 2013 3q when xml available

Although it s quite ease to parse xml by python elementTree, i still struggling in how to deal with those txt data before 2013 3q.

Any constructive thoughts ?

Thank you very much

By the way, i am also in Ontario CA 🙂

• Kai Chen says:

My current knowledge about text data processing is also limited. To start with, you can learn regular expressions. That could resolve many questions. The high-level applications needs the knowledge of natural language processing using Python (or other programming languages). It’s technically intimidating.

22. Irem says:

Hi Ken, thanks for your post. I am just wondering whether you know if it is possible to automate an online download with Stata. This is the only program that I am comfortable using but never wrote such a code before.

• Kai Chen says:

I don’t know the exact answer, but because Stata is not a general-purpose programming language, it will be awkward to use Stata to do this.

23. Jing says:

Hi Kai, it is a very useful code. But I encountered below errors executing your exact code on both of my computers. Do you know the reason and do you know how I can solve it? I need to analyze the 10Ks to finish my homework. I appreciate your thoughts. Thank you very much!
—————————————————————————
MemoryError Traceback (most recent call last)
in ()
38 with engine.connect() as conn, conn.begin():
—> 40 data.to_stata(‘edgar_idx.dta’)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in to_stata(self, fname, convert_dates, write_index, encoding, byteorder, time_stamp, data_label, variable_labels)
1487 write_index=write_index,
1488 variable_labels=variable_labels)
-> 1489 writer.write_file()
1490
1491 def to_feather(self, fname):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\stata.py in write_file(self)
2137 # write 5 zeros for expansion fields
-> 2139 self._prepare_data()
2140 self._write_data()
2141 self._write_value_labels()

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\stata.py in _prepare_data(self)
2272
2273 if has_strings:
-> 2274 self.data = np.fromiter(zip(*data_cols), dtype=dtype)
2275 else:
2276 self.data = data.to_records(index=False)

MemoryError: cannot allocate array memory

• Kai Chen says:

Hi Jing, I am not sure where the problem lies. Maybe you didn’t install required modules correctly, or maybe you machines didn’t have enough free RAM. Try changing start_year to a very recent year, say 2017, and see if the code can go through. If yes, then probably RAM is the culprit.

24. CJ Hu says:

Hi Chen,

Thanks so much for your tutorial and sample codes. I really appreciate it. Your codes work perfectly fine except that python(v3.5 using pycharm) hangs when trying to download 2011QTR4 and 2017QTR3 master.idx. I manage to download all other master.idx by changing the year/quarter variables. I have tried on another computer and the same happened. Any idea what might be the problem? I can work through the Stata file you uploaded for now but I was thinking of extending the data in the future.

Thanks.

• Kai Chen says:

Hi Chang Jie. It just takes longer (e.g., around 11 minutes to download 2011QTR4 on my laptop) but will be able to go through eventually. I don’t know why though.

• Yannick B says:

It is not the download that takes time. For some reason the requests library is having difficulty to auto-detect the encoding on theses 2 files. If detects ‘ISO-8859-1’ instead of the standard ‘ascii’.

I fixed it as such for the moment:

rep = requests.get(url)
rep.encoding = ‘ISO-8859-1’
lines = rep.text.splitlines()

25. Scott Anderson says:

Hi Kai,

Is there a way for this also to include the time the report was submitted in addition to the date?

26. Philip Bastiansen says:

Dear Kai Chen!

First of all – THANK you so much for doing this, you truly are amazing! I have been trying to run your code, but it doesn’t work due to a memoryloss error (due to my lack of RAM… I have 8GB). So, if I change the start date to 2015 it works flawlessly! However, I do need the data back to 1993. But I was thinking if there was a way to limit the dimension of this code, to only include the master files for all 10K types?

I’d love to hear back from you, since I am currently working on my master’s thesis!

Best regards,

• Kai Chen says:

Philip, you can download the data chunk by chunk, for example, first download data over 1993-1996, then download data over 1997-2000, and so on. All you need to do is to change the first three lines: set current quarter equal 4; for data 1993-1996, set start year equal 1993 and current year equal 1997; then for data 1997-2000, set start year equal 1997 and current year 2000, and so on; lastly, combine all data chunks.

• Philip Pagter Bastiansen says:

Thank you so much, that solved it for me! 🙂

27. After checking out a number of the blog articles on your website, I really like your way of blogging.

I saved it to my bookmark website list and will be checking
back soon. Take a look at my website as well and tell me how you feel.

28. Dominik says:

Many thanks Kai Chen,

great code for assessing the EDGAR database.

I only have a problem with partI of your code as it has some error with panda before writing to STATA .dta file – can you help me on that? Many thanks!

~ line 42, in
parse_dates=parse_dates, columns=columns, chunksize=chunksize)
chunksize=chunksize)
data = result.fetchall()
~sqlalchemy\engine\result.py”, line 1137, in fetchall
self.cursor, self.context)
~sqlalchemy\engine\base.py”, line 1416, in _handle_dbapi_exception
util.reraise(*exc_info)
~sqlalchemy\util\compat.py”, line 249, in reraise
raise value
~sqlalchemy\engine\result.py”, line 1131, in fetchall
l = self.process_rows(self._fetchall_impl())
~sqlalchemy\engine\result.py”, line 1111, in process_rows
for row in rows]
~sqlalchemy\engine\result.py”, line 1111, in
for row in rows]
MemoryError

• Kai Chen says:

My best guess is the last module pandas changes the way it handles memory and as a result, it requires more memory to complete the conversion. I created the Stata dataset without any issue in March 2017, but now I can only create the Stata dataset successfully in smaller pieces. In other words, I had to create multiple Stata datasets, the first for 1993–2000, the second for 2001–2005, the third for 2006–2010, and so on. I’m curious how large memory pandas now requires to convert that single SQLite database to Stata in just one step. My 24M memory iMac failed.

29. Dominik says:

Hi Kai Chen,

many thanks for your helpful post. Unfortunately your first code does not run through at my machine.

It would be great if you could upload a new version of the Stata dataset for download.

Many thanks!

30. Dom says:

Hi Kai Chen,

many thanks for your great post!

I was wondering whether the “Series ID”, which has to be published by Investment Companies, can also be downloaded via the Index file? Do you have any idea for the matching?

https://www.sec.gov/open/datasets-investment_company.html

Best regards and many thanks,
Dom

31. Caleb Wang says:

Hi Kai,