
Machine Learning course term project, collaborated with Dr. Daniel Chen, and Dr. Elliot Ash

Primary LanguageJupyter Notebook

Predicting exchange rate change with diplomacy cables

  • This project's objective is to measure how confidential information affects the financial market. We know that market efficiency is based on the assumption that all market participants have fair access to the market. More importantly, it includes any kind of information. However, people do possess confidential information in the real world. Most financial markets have prohibited any kind of inside trading. Because of the difficulties to access confidential data, detecting the influence of confidential information affect market is hard.

  • Luckily, we now have Public Library of US Diplomacy dataset. We scraped documents from 2000 to 2010. With the scrapped data, we built models to detect if we can predict the abnormal change of the exchange rate in different countries. The historical exchange rate is fetched from CRSP datasets and this dataset covers 21 countries, which are Australia, Brazil, Canada, China, Denmark, Hong Kong, India, Japan, Korea, Malaysia, Mexico, New Zealand, Norway, Sweden, South Africa, Singapore, Sri Lanka, Taiwan, Thailand, United Kingdom and Venezuela.

Data set

We have two datasets: Public library of US diplomcy and the exchange rate dataset crapped from CRSP. For the purpose of error analysis, we split data by country and year. You can access data through the links below.  

  • country by country
    • date: the date of wikileaks cable
    • content: the content of wikileaks cable
    • exchange rate: the 15 days log return of exchange rate before time t
    • numerical label: log return of exchange rate at time t
    • dummy label: abnormal log return or not
We also include a joint table of all conuntries's infromation named as final_All_countries.

  • year by year
    • date: the date of wikileaks cable
    • content: the content of wikileaks cable
    • exchange rate: the 15 days log return of exchange rate before time t
    • numerical label: log return of exchange rate at time t
    • dummy label: abnormal log return or not
And the following is a joined table, cluding all countries and all year

  • all by all
    • date: the date of wikileaks cable
    • content: the content of wikileaks cable
    • exchange rate: the 15 days log return of exchange rate before time t
    • numerical label: log return of exchange rate at time t
    • dummy label: abnormal log return or not
Build the regression and classfication model

To build the Random Forest Regression Model:

  1. Change working directory to the directory of RanFrst_regres.py.
  2. Run the python script as follwoing instructions:

if user is interested in classifier function in juypter notebook, we have the following two examples

  • All_Countries_Neg_AUC_May17th.ipynb : Do the classification and AUC plot
  • Feature_Importance_All_Columns.ipynb: finding out the feature importance

searching the file name was written in hard code. So, if you want to rename the data, you would need to modify the main function.

  • For example, the folder of Country by Country and Year by Year

run random forest classifier

python RanFrst_classfy.py 10 ./data_country/ country_10 -country
python RanFrst_regres.py 10 ./data_year/ country_10 -year

Here, the pararmeters:

  • 10: the number of estimators in random forest model
  • ./data_country/: input data path, supposed data is stored under ./data_country/
  • country_10: output data path, it would automaticall create a directory called ./output_country_10/
  • -country: let the model know it is searching what kind of data (country level or year)
  • -year: let the model know it is searching what kind of data (country level or year)
  • Year by Year

run random forest classifier regression

python RanFrst_regres.py 10 ./data_country/ year_10 -all
python RanFrst_classfy.py 10 ./data_year/ year_10 -all
  • 10: the number of estimators in random forest model
  • ./data_year/: input data path, supposed data is stored under ./data_year/
  • year_10: output data path, it would automaticall create a directory called ./output_year_10/
  • -all: means run all feature functions with text, exchange rate and mix features. -text, -price and -mix only run the functions. This can help speed up the time computation.

Result - regression

RanFrst_regres.py would automatically generate output under the user defined output directory, ex ./output_country_10/ or ./output_year_10/.

  • Country by Country It would generate 22 files named by the country name, ex: the input final_Single_mexico file would generate file mexico with
  • mse: mean square error
  • mae: mean absolute error
  • median_absolute_error: median absolute error
  • r2: r squre score
  • (2086, 5): total 2086 instances with 5 features. (it's the size of input dataframe)
Country mexico
mse_text is 1.979e-05
mae_text is 1.979e-05
median_absolute_error stripes 0.00263002
r2_text -1.20974962
------------- -------------
mse_price is 8.6e-07
mae_price is 0.0004434
mdn_ae_price stripes 0.00011176
r2_price 0.9034855
------------- -------------
mse_mix 6.86e-06
mae_mix 0.00144322
mdn_ae_mix 0.00040706
r2_mix 0.23445956
------------- -------------
(2086, 5)
  • Year by Year It would generate 10 files named by the year. ex. the input 2003 files would generate 2003 with
  • mse: mean square error
  • mae: mean absolute error
  • median_absolute_error: median absolute error
  • r2: r squre score
  • (1233, 5): total 1233 instances with 5 features. (it's the size of input dataframe)
Year 2003
mse_text is 6.79e-06
mae_text is 6.79e-06
median_absolute_error stripes 0.00162758
r2_text -0.03564915
------------- -------------
mse_price is 4.17e-06
mae_price is 0.00126901
mdn_ae_price stripes 0.00069519
r2_price 0.36338027
------------- -------------
mse_mix 5.36e-06
mae_mix 0.00179364
mdn_ae_mix 0.00137725
r2_mix 0.18312545
------------- -------------
(1233, 5)

Output - classify

RanFrst_classfy.py would automatically generate output under the user defined output directory, ex ./output_country_10/ or ./output_year_10/. What's more, it would automatically create a folder and store three generated AUC figures.

  • Country by Country It would generate 22 files named by the country name, ex: the input final_Single_australia file would generate file australia with the following table,
  • fpr: increasing false positive. in this example only has one treshold
  • tpr: increasing true positive. in this example only has one treshold
  • roc_auc: increasing accuracy, computing area under the receiver operating characteristic curve (ROC AUC)
Country australia
fpr is: [ 0. 0.06140351 1. ]
tpr is: [ 0. 0. 1.]
roc_auc is 0.469298245614
------------- -------------
fpr is: [ 0. 0.00877193 1. ]
tpr is: [ 0. 0.85714286 1. ]
roc_auc is 0.924185463659
------------- -------------
fpr is: [ 0. 0.06140351 1. ]
tpr is: [ 0. 0.23809524 1. ]
roc_auc is 0.588345864662
------------- -------------
(1350, 5)

and three figures showing roc_aucc under ./output_country_10/mexico_figure/


Trained Model and Results with 30 trees after feature selection

  • Data Sliced by Year as training and test sets, using top 50 important features in random forest regression to predict negative returns of exchange rate over time.

all_year_neg_30 after feature selection

The folder contains:

  • model output by each years, ex: 2000
  • the folder model contains:
    • pickle files with file name suffix 0 -- only text, 1 -- only exchange rate, 2 -- mixed features
  • Data Sliced by Country as training and test sets, using top 50 important features in random forest regression to predict negative returns of exchange rate over time.

all_country_neg_30 after feature selection

The folder contains:

  • model output by each country, ex: sweden
  • the folder model contains:
    • pickle files with file name suffix 0 -- only text, 1 -- only exchange rate, 2 -- mixed features
  • Full dataset as training and test sets, using top 50 important features in random forest regression to predict negative returns of exchange rate over time.

Full dataset neg 30 after feature selection

The folder contains:

  • model output by full dataset, ex: countries
  • the folder model contains:
    • pickle files with file name suffix 0 -- only text, 1 -- only exchange rate, 2 -- mixed features