Use Python to calculate the tone of financial articles

[Update on 2019-03-01] I completely rewrite the Python program. The updates include:

  • I include two domain-specific dictionaries: Loughran and McDonald’s and Henry’s dictionaries, and you can choose which dictionary to use.
  • I add negation check as suggested by Loughran and McDonald (2011). That is, any occurrence of negate words (e.g., isn’t, not, never) within three words preceding a positive word will flip that positive word into a negative one. Negation check only applies to positive words because Loughran and McDonald (2011) suggest that double negation (i.e., a negate word precedes a negative word) is not common. I expand their negate word list though, since theirs seem incomplete. In my sample of 90,000+ press releases, negation check finds that 5.7% of press releases have positive word(s) with a preceding negate word.

Please note:

  • The Python program first transform an article into a bag of words in their original order. Different research questions may define “word” differently. For example, some research questions only look at alphabetic words (i.e., remove all numbers in an article). I use this definition in the following Python program. But you may want to change this to suit your research question. In addition, there are many nuances in splitting sentences into words. The splitting method in the following Python program is simple but imperfect of course.
  • To use the Python program, you have to know how to assign the full text of an article to the variable article (using a loop) and how to output the results into a database-like file (Sqlite or CSV).

I acknowledge the work done by C.J. Hutto (see his work at GitHub).

[Original Post] I find two internet resources for this task (thank both authors):

The first solution is way more efficient than the second, but the second is more straightforward. The first needs extra knowledge of PostgreSQL and R besides Python. I borrow from the two resources and write the Python code below.

Please note, to use the Python code, you have to know how to assign the full text of an article of interest to the variable text, and how to output the total word count and the counts of positive/negative words in text.

In the first part of the code, I read the dictionary or the word list into a Python dictionary variable. The word list used here is supposed to be a .txt file and in the following format:

For accounting and finance research, a commonly used positive/negative word list was developed by Bill McDonald. See his website.

In the second part of the code, I create regular expressions that are used to find occurrences of positive/negative words. The last few lines of codes are used to get the counts of positive/negative words in the text.

This entry was posted in Python. Bookmark the permalink.

9 Responses to Use Python to calculate the tone of financial articles

  1. Ian Gow says:

    I agree that my solution is more complex. But in part that’s because it’s a more complete solution. One has to download and process the data from Bill MacDonald (“see his website for download” implies undocumented steps in the process). Then one has to organize and perhaps process the text so it can be fed to the Python function. Finally, one needs to handle the output.

    I think the first step on my site could be done in Python (rather than R … my decision to use R is more a reflection of my comparative advantage in R than anything inherent to Python). And the second step could be done without PostgreSQL (especially if the first step is done in Python). I think a “pure Python” approach would be more elegant than what I have, at least as a code illustration.

    • Kai Chen says:

      Hi Ian, happy to hear your thoughts promptly – I like your blog and really benefit from it.

      I like how you deal with the regular regression pattern. It is very efficient, saving the trouble to use too many loops. In my experiment, your code is about 6 times faster than the other. I agree that your solution is more complete, and that reading texts from and outputting tone counts to a database is a better idea than reading/writing CSV. In my codes, I do bypass the feeding and outputting part in my post.

  2. Mu Civ says:

    Hi Kai, I’m new to Python, so I really appreciate your code!

    Unfortunately, it doesn’t work for me though. Few errors occured:

    #1 NameError: name ‘re’ is not defined -> I added “import re”, which helped I guess

    #2 NameError: name ‘text’ is not defined -> I defined text as text = “Bsp.text” (which is the document I would like to analyse). This also seemed to help, at least the error does not occur anymore.

    #3 NameError: name ‘count’ is not defined -> I really don’t know how to fix this one though… Can you help me please?

    Thanks in advance!

    • Mu Civ says:

      Hi Kai,

      I’ve already solved my problem.

      Here is the last part of the code (if anyone should be interested):

      # Get tone count

      with open(‘Bsp.txt’, ‘r’) as content_file:
      content = content_file.read()

      count = {}
      wordcount = len(content.split())
      for cat in dict.keys():
      count[cat] = len(regex[cat].findall(content))

      print(count)

      Thanks and have a nice day. 🙂

  3. Tom Jones says:

    Apart from the fact that your code doesn’t actually work, its great.

    • Kai Chen says:

      I never mean to provide click-and-run codes. If the codes do not work on your computer, you should do more debug on your own. Many factors (Python version, operating system, …) can cause a break during the running.

  4. Victor says:

    Good code!

    For my own code, I realize that I only tested for negating words immediately preceding the positive words, instead of within 3 words. I didn’t read Loughran and McDonald (2011) carefully.

    I also realize that it would be even better if we first tokenize an article into sentences and do the negation test within the boundary of each sentence.

    For the definition of words, there are indeed no single definition. For example, Loughran and McDonald seem to define a word as [a-zA-Z]+. In their master dictionary, you can see “email”, but not “e-mail”. “e-mail” will become two words: “e”, and “mail”. By the same definition, “10-K” will become “K”. Sometimes people remove single-letter word. If you use nltk’s word tokenizer, “couldn’t” will become “could” and “n’t”, and “company’s” will become “company” and “‘s”, “e-mail” will be still “e-mail”, “$5.0” will become “$” and “5.0”. People often apply further screening to remove punctuations, and tokens containing digits and punctuations. I find that after removing punctuations, the nltk tokens will be very close to Microsoft’s definition of words.

    Papers often do not make clear about their own definitions. This makes replication difficult.

Leave a Reply

Your email address will not be published. Required fields are marked *