The Python script in the original post has been removed as its use violates the Terms of Service of the data provider.
Stanford Law School’s Securities Class Action Clearinghouse is always happy to share the data (subject to a Non-Disclosure Agreement) with academic researchers for non-commercial research or analysis. If you have any data needs, please contact their SCAC Content Manager at scac@law.stanford.edu.
Awesome! Thanks for sharing!
I was about to go through building a scraper for this from scratch… you saved me so much time! This is great!
Hi Dr. Chen,
Thanks so much for this coding. I just got stuck in using this codes as the Securities Class Action Clearinghouse requires login to get the full data. I tried “mechanize” pckage to login but it doesn’t work. Do you have any ideas about how to get the access to the website?
Hi do you have a solution to this problem?
Kind regards,
Yannick
See the update on Jan 8, 2022.
added error handling in get_class_period method to avoid the issue if the case’s status is currently Active.
def get_class_period(soup):
section = soup.find(“section”, id=”fic”)
try:
text = section.find_all(“div”, class_=”span4″)
start_date = text[4].get_text()
end_date = text[5].get_text()
except:
start_date = ‘null’
end_date = ‘null’
return start_date, end_date
Thanks for the correction. But this only solves the error issue. It does not return the class period for any lawsuits. Any idea how I can get the class period and access the lawsuit files? The html does not even show the contents beyond case summary.
how do you parse settlement value?
Hi Dr. Chen and Shiyu! Thank you so much for sharing this! I appreciate it!
Thank you so much Dr.Chen!
Just a small note: you need to set the Chrome default to maximize the window, or add this before line 18:
driver.maximize_window()
Thanks for sharing the files and codes, very useful!
Dr. Chen,
This is awesome. Thank you for your generous sharing!
hi kai,
thanks again for making your code available.
iam also trying to scrape the legal documents/pdfs.
this code works to download other url pdf files but not the pdfs from stanford class action clearinghouse(the only difference, i can see is that there is login required but i am already logged in by the time we reach this code):
import requests
# file_url = “https://www.bu.edu/econ/files/2014/08/DLS1.pdf”
file_url1 = ‘http://securities.stanford.edu/filings-documents/1080/IBMC00108070/2023113_f01c_23CV00332.pdf’
r = requests.get(file_url1, stream = True)
with open(“C:/Users/inter/OneDrive/Desktop/securities_class_action_docs/test.pdf”, “wb”) as file:
for block in r.iter_content(chunk_size = 1024):
if block:
file.write(block)
i meant to ask you. how do you think i can download the pdfs into google drive as pdfs and not htmls?