I helped my friend to download data from the DTCC’s Swap Data Repository. I am not familiar with the data and just use this as a programming practice.
This article gives an introduction to the origin of the data: http://www.dtcc.com/news/2013/january/03/swap-data-repository-real-time
The Python script will:
- download the daily Credit zip files; and
- extract CSV from individual zip files and combine the content into a single huge CSV (size 220MB), which then can be imported into Stata or other statistical package.
As of April 22, 2016, there were around one million historical records. The data seems available from April 6, 2013 and missing sporadically from then on. The Python script will print the bad dates where the daily data is not available.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
import io import zipfile from datetime import date import pandas as pd import requests start = date(2013, 1, 1) end = date.today() urls = [] for i in range(start.toordinal(), end.toordinal()): datestr = date.fromordinal(i).isoformat().replace('-', '_') url = ('https://kgc0418-tdw-data2-0.s3.amazonaws.com/slices/CUMULATIVE_CREDITS_' + datestr + '.zip', 'CUMULATIVE_CREDITS_' + datestr + '.zip') urls.append(url) badurls = [] df = pd.DataFrame() for url in urls: request = requests.get(url[0]) if not zipfile.is_zipfile(io.BytesIO(request.content)): print(url[1], 'is non-existent!') badurls.append(url) else: with open(url[1], 'wb') as f: f.write(request.content) print(url[1], 'downloaded!') z = zipfile.ZipFile(io.BytesIO(request.content)) df_ = pd.read_csv(z.open(z.namelist()[0])) df_['DATE'] = url[1][19:29] df = df.append(df_, ignore_index=True) df.to_csv('dtcc.csv') print(badurls) |