Use Python to extract Intelligence Indexing fields in Factiva articles

First of all, I acknowledge that I benefit a lot from Neal Caren’s blog post Cleaning up LexisNexis Files. Thanks Neal.

Factiva (as well as LexisNexis Academic) is a comprehensive repository of newspapers, magazines, and other news articles. I first describe the data elements of a Factiva news article. Then I explain the steps to extract those data elements and write them into a more machine-readable table using Python.

Data Elements in Factiva Article

Each news article in Factiva, no matter how it looks like, contains a number of data elements. In Factiva’s terminology, those data elements are called Intelligence Indexing Fields. The following table lists the label and name for each data element (or, field) along with what is contained in each:

Field LabelField NameWhat It Contains
HDHeadlineHeadline
CRCredit InformationCredit Information (Example: Associated Press)
WCWord CountNumber of words in document
PDPublication DatePublication Date
ETPublication TimePublication Time
SNSource NameSource Name
SCSource CodeSource Code
EDEditionEdition of publication (Example: Final)
PGPagePage on which article appeared (Note: Page-One Story is a Dow Jones Intelligent Indexingª term)
LALanguageLanguage in which the document is written
CYCopyrightCopyright
LPLead ParagraphFirst two paragraphs of an article
TDTextText following the lead paragraphs
CTContactContact name to obtain additional information
RFReferenceNotes associated with a document
CODow Jones Ticker SymbolDow Jones Ticker Symbol
INIndustry CodeDow Jones Intelligent Indexingª Industry Code
NSSubject CodeDow Jones Intelligent Indexingª Subject Code
RERegion CodeDow Jones Intelligent Indexingª Region Code
IPCInformation Provider CodeInformation Provider Code
IPDInformation Provider DescriptorsInformation Provider Descriptors
PUBPublisher NamePublisher of information
ANAccession NumberUnique Factiva.com identification number assigned to each document

Please note that not every news article contains all those data elements, and that the table may not list all data elements used by Factiva (Factiva may make updates). Depending on which display option you select when downloading news articles from Factiva, you may not be able to see certain data elements. But they are there and used by Factiva to organize and structure its proprietary news article data.

How to Extract Data Elements in Factiva Article

flow

You can follow three steps outlined in the above diagram to extract data elements in news articles and for further processing (e.g., calculate tone of full text represented by both LP and TD element; or group by news subject, i.e., by NS element). I explain them one by one as follows.

Step 1: Download Articles from Factiva in RTF Format

It is a lot of pain to download a large number of news articles from Factiva: it is technically difficult to download articles in an automated fashion; you can only download 100 articles at a time, also those 100 articles cannot exceed the word count limit, i.e., 180,000. As a result, it requires a lot of tedious work if you want to gather tens of thousands news articles. While I can do nothing about both issues in this post, I can say a bit more about them.

Firstly, you may see some people discuss methods for automatic downloading (a so-called “webscraping” technique. See here). However, this needs more hacking after Factiva introduced CAPTCHA to determine whether or not the user is a human. You may not be familiar with the term “CAPTCHA”, but you must experience the circumstance where you are asked to input characters or numbers shown in an image before you can download a file or go to the next webpage. That is CAPTCHA. Both Factiva and LexisNexis Academic have introduced CAPTCHA to prohibit robotic downloading. Though CAPTCHA is not unbeatable, it requires advanced technique.

Secondly, the Factiva licence expressly prohibits data mining. However, the licence does not define clearly what constitutes data mining. I was informed that downloading a large number of articles in a short period of time would be red flagged as data mining. But the threshold speed set by Factiva is low and any trained and adept person can beat that threshold speed easily. If you are red flagged by Factiva, things could go ugly. So, do not be too fast, even this may slow down your research.

Let’s get back to the topic. When you manually download news articles from Factiva, the most important thing is to select the right display option. Please select the third one: Full Article/Report plus Indexing as indicated by the following graph:

Factiva

Then you have to download articles in RTF – Article Format, as indicated by the following graph:

Factiva2

After the download is completed, you will get an RTF document. If you open it, you will find news articles look like this:

Factiva3

The next step is to convert RTF to plain TXT, because Python can process TXT documents more easily. After Python finishes its job, the final product will be a table: each row of the table represents a news article; and each column of the table is a data element.

Step 2: Convert RTF to TXT

Well, this can surely be done by Python. But so far I have not written a Python program to do this. I will complete this “hole” when I have time. For my research, I simply take advantage of the convenience of the default text editor shipped with Mac OS, TextEdit. I select Format – Make Plain Text from the menu bar, and then save the document in TXT format. You can make this happen in an automatic fashion using Automator in Mac OS.

Step 3: Extract Data Elements and Save to a Table

This is where Python does the dirty work. To run the Python program correctly, please save the Python program in the directory where you put all plain TXT documents created in Step 2 before you run the program. This program will:

  1. Read in each TXT document;
  2. Extract data elements of each article and write them to an SQLite database;
  3. Export data to a CSV file for easy processing in other software such as Stata.

I introduce an intermediate step which writes data to an SQLite database, simply because this can facilitate manipulation of news article data using Python for other purposes. Of course, you can directly write data to a CSV file.

This entry was posted in Python. Bookmark the permalink.

15 Responses to Use Python to extract Intelligence Indexing fields in Factiva articles

  1. Nguyen says:

    Hi there,
    I am using your method to extract information from Factiva.
    However, the code has some problem that I cannot run it smoothly.
    I have the following error:
    UnicodeEncodeError: ‘ascii’ codec can’t encode character ‘\xa9’ in position 165: ordinal not in range(128)
    Could you please help me to solve it?

  2. Anna says:

    Hi Kai,

    I have also problems extracting information from Fictive with your code.
    I have the following error:

    return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xe2 in position 22: ordinal not in range(128)

    It seems there is also a problem with “parser(f)”.
    Could you please help me to solve it?
    Thank you!

  3. Joel Nothman says:

    With the HTML export rather than RTF, you can get a great representation of the data with this one-liner!

    import pandas as pd
    data = pd.concat([art for art in pd.read_html(‘/path/to/factiva-export.html’, index_col=0) if ‘HD’ in art.index.values], axis=1).T.set_index(‘AN’)

    You are welcome to then use data.to_sql() or data.to_csv()…

  4. Joris says:

    Hi Kai,

    I am trying to use your method to transform data from Factiva, but have run into an issue. Could you help me with this?

    When executing the code I get the following error;
    ————————————————
    File “factiva.py”, line 72, in
    parser(f)
    File “factiva.py”, line 15, in parser
    start = re.search(r’\n HD\n’, data).start()
    AttributeError: ‘NoneType’ object has no attribute ‘start’
    ————————————————
    I am working with python 3.6 on windows 7. The location of the script is the same as the txt file.

    Your help would be greatly appreciated!!

    • Kai Chen says:

      Hi Joris,

      This is probably because the program cannot find text files in line 70. Two solutions: (1) Add the full path to text files in line 70, e.g., for f in glob.glob(‘C:/Downloads/*.txt’). I don’t have a windows machine. Please check glob documentation if the syntax is incorrect. But basically you need to specify the full path. (2) A more future-proof solution is to use PyCharm, an advanced Python IDE which will search for text files as well as other inputs automatically in the folder of the Python code.

      I hope this helps.

      • Joris says:

        Hi Kai,

        Thank you for your response, I really appreciate it.

        I have tried both the solutions, but the error code persists. It does now seem to observe the first txt file but then gets stuck.
        In the source folder it does create a DB file but no CSV.
        With what OS/configuration does it work for you?

        It now reads;
        ———————————
        C:\path\to\factiva-1.txt
        Traceback (most recent call last):
        File “C:\path\to\factiva.py”, line 72, in
        parser(f)
        File “C:\path\to\factiva.py”, line 15, in parser
        start = re.search(r’\n HD\n’, data).start()
        AttributeError: ‘NoneType’ object has no attribute ‘start’
        ————————
        Would i maybe need to use Python 3.5 for it to work?

        Once more, thank you very much for your help!

  5. Xu says:

    Hi Kai,

    Thank you for sharing the code! I have a little trouble running it and hope you could offer some help. Both the db file and csv file always have the first article missing. In the second row of a csv file, it would show nid and id correctly, but nothing for all other elements. Here is an example: 1,3m,,,,,,,,,,,,,,,,,,,,,,,

    I am using Python 3.6.5 with Spyder.

    Thank you!

    • Xu says:

      Never mind. I found that for whatever reason, there are three spaces instead of one space in front of the first “HD”. Thank you anyway!

  6. Grace says:

    Dear Kai,

    Thank you very much for sharing your code! However, I ran into some problems when running the codes. The output csv file only had part of the indexing right, and many sentences that start with break words (such as BUT, ALTHOUGH..) and the article contents were unorganized and spread out everywhere in the spreadsheet. Does it have something to do with my original RTF file? (My RTF file look quite different from yours with tables though I have followed exactly your instructions; Mine could be opened using Microsoft Word and there are pictures, ect. in it) How could I fix the problem? Thank you very much!

    • Kai Chen says:

      Hi Grace, the Python program is very picky on RTF layout and you have to make it right. Or, you need to fine tune the program to adapt it to your downloads.

      • Grace says:

        Dear Kai,

        Thank you so much for your reply! I will try to do it on Mac and see if I can get the RTF layout right. I am also an Accounting PhD (about to graduate this year) and I greatly benefited from other materials on your websites as well. Thank you for your selfless help.

  7. HA R says:

    Is there something I can do to parse the articles which I have downloaded without the indexes by mistake?

Leave a Reply to Xu Cancel reply

Your email address will not be published. Required fields are marked *