[Update on 2022-01-08] I am grateful to my research assistant, Shiyu Chen, for her excellent work in this update.
This website requires login now. A function was added for login and access to the protected content. The code may terminate (e.g., every 80 pages) due to timeout or connection error, and thus you may need to run it several times (and change Line 108 accordingly). Please replace the login email and password with your own (Line 104 and 105). CSV files including 6,122 cases as of 2022-01-08 are provided for easy download (Securities Class Action Filings 2022-01-08 p1 to p84, Securities Class Action Filings 2022-01-08 p85 to p161, and Securities Class Action Filings 2022-01-08 p162 to p205).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
""" ------------------------------------------- Author: Shiyu Chen __updated__="2022-01-03" ------------------------------------------- """ from selenium import webdriver from selenium.webdriver.chrome.service import Service from webdriver_manager.chrome import ChromeDriverManager from selenium.webdriver.common.by import By from time import sleep from bs4 import BeautifulSoup import re import requests import csv # ------------------------------------------------------------------------------ # ------------------------------------------------------------------------------ # ------------------------------------------------------------------------------ def get_html(url): driver = webdriver.Chrome(service=s) driver.get(url) driver.find_element(By.XPATH, '/html/body/header/div[1]/div/div[2]/ul/li[6]/a[2]/strong').click() sleep(3) driver.find_element(By.XPATH, '/html/body/div[4]/form/div[1]/div[1]/input').send_keys(email) driver.find_element(By.XPATH, '/html/body/div[4]/form/div[1]/div[2]/input').send_keys(password) driver.find_element(By.XPATH, '/html/body/div[4]/form/div[2]/button[3]').click() sleep(3) html = driver.page_source driver.quit() return html def get_total_page_number(): req = requests.get("http://securities.stanford.edu/filings.html") html = req.text soup = BeautifulSoup(html, 'html.parser') page = str(soup.find("div", class_="span6")) pattern = r"\((?P<number>\d+)\)" object = re.search(pattern, page) num = int(object.group("number")) # print(num) page_number = num // 30 + 1 return page_number def get_all_lines_in_one_page(pn): url = url_ + str(pn) # req = requests.get(url) # html = req.text html = get_html(url) soup = BeautifulSoup(html, 'html.parser') # <tr class="table-link" page="filings" onclick="window.location='filings-case.html?id=107058'"> all_line = soup.find_all("tr", class_="table-link", page="filings") return all_line def get_basic(one): pattern_link = r"id=(?P<id>\d*)" id = (re.search(pattern_link, str(one))).group("id") link_fix = "http://securities.stanford.edu/filings-case.html?id=" link = link_fix + id # title------------------ info = one.find_all("td", class_="") name = info[0].get_text(strip=True) date = info[1].get_text(strip=True) court = info[2].get_text(strip=True) exchange = info[3].get_text(strip=True) ticker = info[4].get_text(strip=True) return name, date, court, exchange, ticker, link def get_summary(link): req = requests.get(link) html = req.text soup = BeautifulSoup(html, 'html.parser') # <div class="span12" style="background-color: #ffffff;"> summary = str(soup.find("div", class_="span12", style="background-color: #ffffff;")) summary = re.sub(r"</?(.+?)>", "", summary) summary = re.sub(r"\s+", " ", summary) return summary, soup def get_other(soup): section = soup.find("section", id="summary") s_u = section.find("p").get_text().strip() p_status = r"Case Status:(\W+)(?P<status>\w+)(\W*)On" status = re.search(p_status, s_u) if status is None: status = "" else: status = status.group("status") # update data------------------------------------ p_update = r"On or around .+\)" update_date = re.search(p_update, s_u) if update_date is None: update_date = "" else: update_date = update_date.group() # ------------------------------- filing_date = section.find("p", class_="lead").get_text() filing_date = re.search(r"Filing Date: (?P<filing_date>.+)", filing_date).group("filing_date") return status, update_date, filing_date # ------------------------------------------------------------------------------ # ------------------------------------------------------------------------------ # ------------------------------------------------------------------------------ email = 'your login email' password = 'your password' s = Service(ChromeDriverManager().install()) url_ = "http://securities.stanford.edu/filings.html?page=" pn = 1 with open("Securities Class Action Filings 2022-01-08.csv", "w") as csvfile: writer = csv.writer(csvfile) writer.writerow( ["Filling Name", "Filing Date", "District Court", "Exchange", "Ticker", "Link", "Case Status", "Update Date", "Filing Date", "Summary"]) while pn <= get_total_page_number(): all_line = get_all_lines_in_one_page(pn) for oneline in all_line: name, date, court, exchange, ticker, link = get_basic(oneline) summary, soup = get_summary(link) status, update_date, filing_date = get_other(soup) # write-------- one = [name, date, court, exchange, ticker, link, status, update_date, filing_date, summary] writer.writerow(one) print(pn) pn += 1 print("Finish") |
[Update on 2019-07-07] I am grateful to Shiyu Chen, my research assistant, who did a very good job on not only web scraping the top-level table, but also extracting from the case summary page additional information (link to case summary page, case status, update date, case summary, case period start date, and case period end date). I post her Python program below with her permission.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
""" ------------------------------------------- [program description] ------------------------------------------- Author: Shiyu Chen __updated__="2019-07-04" ------------------------------------------- """ # Import from bs4 import BeautifulSoup import re import requests import csv # ------------------------------------------------------------------------------ # ------------------------------------------------------------------------------ # ------------------------------------------------------------------------------ def get_total_page_number(): req = requests.get("http://securities.stanford.edu/filings.html") html = req.text soup = BeautifulSoup(html, 'html.parser') page = str(soup.find("div", class_="span6")) pattern = r"\((?P<number>\d+)\)" object = re.search(pattern, page) num = int(object.group("number")) # print(num) page_number = num // 20 + 1 return page_number def get_all_lines_in_one_page(pn): url = url_ + str(pn) req = requests.get(url) html = req.text soup = BeautifulSoup(html, 'html.parser') # <tr class="table-link" page="filings" onclick="window.location='filings-case.html?id=107058'"> all_line = soup.find_all("tr", class_="table-link", page="filings") return all_line def get_basic(one): pattern_link = r"id=(?P<id>\d*)" id = (re.search(pattern_link, str(one))).group("id") link_fix = "http://securities.stanford.edu/filings-case.html?id=" link = link_fix + id # title------------------ info = one.find_all("td", class_="") name = info[0].get_text(strip=True) date = info[1].get_text(strip=True) court = info[2].get_text(strip=True) exchange = info[3].get_text(strip=True) ticker = info[4].get_text(strip=True) return name, date, court, exchange, ticker, link def get_summary(link): req = requests.get(link) html = req.text soup = BeautifulSoup(html, 'html.parser') # <div class="span12" style="background-color: #ffffff;"> summary = str(soup.find("div", class_="span12", style="background-color: #ffffff;")) summary = re.sub(r"</?(.+?)>", "", summary) summary = re.sub(r"\s+", " ", summary) return summary, soup def get_other(soup): section = soup.find("section", id="summary") s_u = section.find("p").get_text().strip() p_status = r"Case Status:(\W+)(?P<status>\w+)(\W*)On" status = re.search(p_status, s_u) if status is None: status = "" else: status = status.group("status") # update data------------------------------------ p_update = r"On or around .+\)" update_date = re.search(p_update, s_u) if update_date is None: update_date = "" else: update_date = update_date.group() # ------------------------------- filing_date = section.find("p", class_="lead").get_text() filing_date = re.search(r"Filing Date: (?P<filing_date>.+)", filing_date).group("filing_date") return status, update_date, filing_date def get_class_period(soup): section = soup.find("section", id="fic") text = section.find_all("div", class_="span4") start_date = text[4].get_text() end_date = text[5].get_text() return start_date, end_date # ------------------------------------------------------------------------------ # ------------------------------------------------------------------------------ # ------------------------------------------------------------------------------ url_ = "http://securities.stanford.edu/filings.html?page=" pn = 1 id = 1 with open("sca.csv", "w") as csvfile: writer = csv.writer(csvfile) writer.writerow( ["ID", "Filling Name", "Filing Date", "District Court", "Exchange", "Ticker", "Link", "Case Status", "Update Date", "Filing Date", "Summary", "Class Period Start", "Class Period End"]) while pn <= get_total_page_number(): all_line = get_all_lines_in_one_page(pn) for oneline in all_line: name, date, court, exchange, ticker, link = get_basic(oneline) summary, soup = get_summary(link) status, update_date, filing_date = get_other(soup) start_date, end_date = get_class_period(soup) # write-------- one = [id, name, date, court, exchange, ticker, link, status, update_date, filing_date, summary, start_date, end_date] writer.writerow(one) id += 1 print(pn) pn += 1 print("Finish") |
[Original Post] Several papers borrow the litigation risk model supplied in Equation (3) of Kim and Skinner (2012, JAE, Measuring securities litigation risk). The logit model uses total asset, sales growth, stock return, stock return skewness, stock return standard deviation, and turnover to estimate a predicted value of litigation risk. The measure of litigation risk is used by Billings and Cedergen (2015, JAE), Kerr and Ozel (2015, TAR), Bourveau, Lou, and Wang (2018, JAR), and Baginski, Campbell, Hinson, and Koo (2018, TAR), and so on (thanks to Chunmei Zhu for the literature review).
The model uses lawsuit data obtained from Stanford Law School’s Securities Class Action Clearinghouse. However, the website does not deliver the data in a downloadable format. I write the Python program for extracting the data from the website (a technique called webscraping).
I use Python 3.x and please install all required modules. I provide the data (as of 2019-07-07) in a CSV file for easy download (sca.csv).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
from urllib import request from bs4 import BeautifulSoup import re from math import ceil import csv # Determine the number of pages to webscrape scac = "http://securities.stanford.edu/filings.html" page = request.urlopen(scac) soup = BeautifulSoup(page, 'html.parser') heading = soup.find_all('h4')[-1].get_text() total_record_num = re.findall(r'\d+', heading)[0] total_page_num = ceil(int(total_record_num) / 20) # Webscrape all pages container = [("filing_name", "filing_date", "district_court", "exchange", "ticker")] i = 1 while i <= total_page_num: url = scac + "?page=" + repr(i) print(url) page = request.urlopen(url) soup = BeautifulSoup(page, 'html.parser') table = soup.find('table', class_ = 'table table-bordered table-striped table-hover') tbody = table.find('tbody') for row in tbody.find_all('tr'): columns = row.find_all('td') c1 = re.sub(r'[\t\n]', '', columns[0].get_text()).strip() c2 = re.sub(r'[\t\n]', '', columns[1].get_text()).strip() c3 = re.sub(r'[\t\n]', '', columns[2].get_text()).strip() c4 = re.sub(r'[\t\n]', '', columns[3].get_text()).strip() c5 = re.sub(r'[\t\n]', '', columns[4].get_text()).strip() container.append((c1, c2, c3, c4, c5)) i = i + 1 # Write to a CSV file with open('scac.csv', 'w', newline='') as csvfile: writer = csv.writer(csvfile) writer.writerows(container) |
Awesome! Thanks for sharing!
I was about to go through building a scraper for this from scratch… you saved me so much time! This is great!
Hi Dr. Chen,
Thanks so much for this coding. I just got stuck in using this codes as the Securities Class Action Clearinghouse requires login to get the full data. I tried “mechanize” pckage to login but it doesn’t work. Do you have any ideas about how to get the access to the website?
Hi do you have a solution to this problem?
Kind regards,
Yannick
See the update on Jan 8, 2022.
added error handling in get_class_period method to avoid the issue if the case’s status is currently Active.
def get_class_period(soup):
section = soup.find(“section”, id=”fic”)
try:
text = section.find_all(“div”, class_=”span4″)
start_date = text[4].get_text()
end_date = text[5].get_text()
except:
start_date = ‘null’
end_date = ‘null’
return start_date, end_date
Thanks for the correction. But this only solves the error issue. It does not return the class period for any lawsuits. Any idea how I can get the class period and access the lawsuit files? The html does not even show the contents beyond case summary.
how do you parse settlement value?
Hi Dr. Chen and Shiyu! Thank you so much for sharing this! I appreciate it!
Thank you so much Dr.Chen!
Just a small note: you need to set the Chrome default to maximize the window, or add this before line 18:
driver.maximize_window()
Thanks for sharing the files and codes, very useful!
Dr. Chen,
This is awesome. Thank you for your generous sharing!
hi kai,
thanks again for making your code available.
iam also trying to scrape the legal documents/pdfs.
this code works to download other url pdf files but not the pdfs from stanford class action clearinghouse(the only difference, i can see is that there is login required but i am already logged in by the time we reach this code):
import requests
# file_url = “https://www.bu.edu/econ/files/2014/08/DLS1.pdf”
file_url1 = ‘http://securities.stanford.edu/filings-documents/1080/IBMC00108070/2023113_f01c_23CV00332.pdf’
r = requests.get(file_url1, stream = True)
with open(“C:/Users/inter/OneDrive/Desktop/securities_class_action_docs/test.pdf”, “wb”) as file:
for block in r.iter_content(chunk_size = 1024):
if block:
file.write(block)
i meant to ask you. how do you think i can download the pdfs into google drive as pdfs and not htmls?