/financial-news-data

Construct a structured DataFrame from the Reuters news corpus

Primary LanguagePythonMIT LicenseMIT

Reuters Financial Dataset as a structured DataFrame

Reuters Financial Dataset is a large collection of Financial News Article scraped from Reuters website. Originally used for the paper Using Structured Events to Predict Stock Price Movement:An Empirical Investigation - Ding et al.(2014) this set of unstructured data is a powerful warehouse of historic Financial Data. This script provides a way of arranging the huge corpus of information into a Pandas' efficient data structure DataFrame

Originally, this repository consisted of badly written Python script which was monolitic and cryptic. This refactor breaks the code down into smaller functions and comes equipped with a function to create the DataFrame.

Usage

The build depends on the following libraries:

  • pandas
  • pyarrow or fastparquet - Pandas optional dependency to read and write DataFrame to parquet format

To generate the parquet file yourself, please run the following commands:

git clone https://github.com/Kriyszig/financial-news-data.git
cd financial-news-data
git clone https://github.com/duynht/financial-news-dataset.git
python3 main.py

If you have cloned the dataset at a particular <path_to_dataset>, you can run the program as follows pointing to the location of the dataset

python3 main.py <path_to_dataset> # Replace <path_to_dataset> with the absolute path to the ReutersNews106521 folder
# For example
python3 main.py /home/user/financial-news-dataset/ReutersNews106521

Please note, the file generation may take upto 20 minutes. DataFrame generation now takes less than 10 seconds. Saving the DataFrame to gzipped parquet file takes less than a minute after optimizing memory allocation.

The financial-data.parquet.gzip is the file that contains the dataset. To create a DataFrame out of this file, please use the code snippet below:

import pandas as pd
df = pd.read_parquet('financial_data.parquet.gzip')

And you are all set to start manipulating df to suit your needs

Dataset

The Dataset has the following columns:

Columns Type
Headline string
Journalists list<string> (Can be empty)
Date Unix Style Date
Link Original Reuters article link
Article The complete report (Can be empty)

Note:

  • Journalists can be an empty list if the original dataset had the field empty
  • Article can be an empty string in case only the headline was reported in th original dataset

In case you run into any troubles, please feel free to open an issue and I'll look into it as soon as possible.

Copyright

IT has come to my notice that due to the Copyright issue with the news article, the original repository by Philippe Rémy has taken down the dataset. That being said, the repository was forked 46 times and some of these forks still contain the Reuters dataset. To avoid copyright infringement, the parquet branch containing the Dataset as a gzip parquet file has been removed. Due to the massive improvement in the build time, it is feasible for anyone to generate the dataset themselves even with a less powerful machine.