- Project Description
- Installation
- Data Understanding
- Code Description
- Licensing, Authors, Acknowledgements
Posted a Medium Blog here: https://yuki678.medium.com/fedspeak-how-to-build-a-nlp-pipeline-to-predict-central-bank-policy-changes-a2f157ca0434?sk=989433349aed4e6dd1faf5a72e848e35
Please refer to the post for business understanding, project overview and analysis result.
Required libraries are described in requirements.txt. The code should run with no issues using Python versions 3.6+. Create a virtual environment of your choice. Here uses Anaconda:
conda create -n fomc python=3.6 jupyter
conda activate fomc
pip install -r requirements.tx
- Create data directory
cd data mkdir FOMC MarketData LoughranMcDonald GloVe preprocessed train_data result cd FOMC mkdir statement minutes presconf_script meeting_script script_pdf speech testimony chair cd ../MarketData mkdir Quandl
- Move to src directory
cd ../../src
- Get data from FOMC Website. Specify document type. You can also specify from year.
python FomcGetData.py all 1980
- Get calendar from FOMC Website. Specify from year.
python FomcGetCalendar.py 1980
- Get data from Quandl. Specify your API Key and From Date (yyyy-mm-dd). You can specify Quandl Code, otherwise all required data are downloaded.
python QuandlGetData.py [your API Key] 1980-01-01
- Download Sentiment Dictionary in data/LoughranMcDonald directory in csv
- Loughran and McDonald Sentiment Word Lists (https://sraf.nd.edu/textual-analysis/resources/)
- Go to top directory
cd ../
- Run the jupyter notebooks
jupyter notebook
- Open and run notebooks No.1 to No.8 for analysis
All notebooks can be executed on Google Colab.
- Upload notebooks to your Google Drive
- Upload downloaded data to your Google Drive (Colab Data dir)
- Execute each notebook (Note: You need to authorize the access to your Google Drive when asked to input the code)
Text data is scraped from FOMC Website. Other economic and market data are downloaded from FRB of St. Louis website (FRED) Data used for each prediction are only those available before the meeting.
- FOMC/fomc_calendar.pickle - all FOMC calendar dates
- FOMC/statement.pickle - Statement text along with basic attributes such as dates, speaker, title. Each text is also available in the directory with the same name. Statements are available post press conference for almost all meetings, which include rate decision and target rate. From 2008, target rate became a range instead of a single value.
- FOMC/minutes.pickle - Minutes text along with basic attributes such as dates, speaker, title. Each text is also available in the directory with the same name. Minutes are summary of FOMC Meeting and contents are structured in sections and paragraphs, most of which were updated in 2011 and 2012. The minutes of regularly scheduled meetings are released three weeks after the date of the policy decision.
- FOMC/presconf_script.pickle - Press conference scripts text along with basic attributes such as dates, speaker, title. Each text is also available in the directory with the same name. This is available from 2011. Starting with the speaker name, so extract those spoken by the chairperson because the other person's words are more likely to be questions and not FOMC's view. It is in pdf form, so download pdf and then process the text.
- FOMC/meeting_script.pickle - Meeting scripts text along with basic attributes such as dates, speaker, title. Each text is also available in the directory with the same name. FOMC decided to publish this five years after each meeting. It contains all the words spoken during the meeting. It will contain some insight about FOMC discussions and how the consensus about monetary policy is built, but cannot be used in prediction as this is not published for five years. It is in pdf form, so download pdf and then process the text.
- FOMC/speech.pickle - Speech text along with basic attributes such as dates, speaker, title. Each text is also available in the directory with the same name. There are many speeches published but some of them are not related to monetary policies but various topics such as regulations and governance. Some speeches may contain indication of FOMC policy, so use only those by the chairperson.
- FOMC/testimony.pickle - Testimony text along with basic attributes such as dates, speaker, title. Each text is also available in the directory with the same name. Like speeches, testimony is not necessarily related to monetary policy. There are semi-annual testimony in the congress, which can be a good inputs of FOMC's view by chairperson, so use only those by the chairperson.
In MarketData/Quandl, csv is saved with Quandl Code as the file name.
- FED Rate
- FRED_DFEDTAR.csv - Target FED Rate till 2008, Daily
- FRED_DFEDTARU.csv - Target Upper FED Rate from 2008, Daily
- FRED_DFEDTARL.csv - Target Lower FED Rate from 2008, Daily
- FRED_DFF.csv - Effective FED Rate, Daily
- GDP
- FRED_GDPC1.csv - Real GDP, Quarterly
- FRED_GDPPOT.csv - Real potential GDP, Quarterly
- CPI
- FRED_PCEPILFE.csv - Core PCE excluding Food and Energy, Monthly
- FRED_CPIAUCSL.csv - Consumer Price Index for All Urban Consumers: All Items in U.S. City Average
- Employment
- FRED_UNRATE.csv - Unemployment Rate, Monthly
- FRED_PAYEMS.csv - Employment, Monthly
- Sales
- FRED_RRSFS.csv - Advance Real Retail and Food Services Sales, monthly
- FRED_HSN1F.csv - New Home Sales, monthly
- ISM
- ISM_MAN_PMI.csv - ISM Purchasing Managers Index
- ISM_NONMAN_NMI.csv - ISM Non-manufacturing Index
- Treasury
- USTREASURY_YIELD.csv - This is optional as not used in the final analysis.
- LoughranMcDonald/LoughranMcDonald_SentimentWordLists_2018.csv - This is used in preliminary analysis and creating Tfidf vectors.
First, take a glance at the FOMC statement to see if it contains any meaningful information.
- ../data/FOMC/statement.pickle
- ../data/MarketData/Quandl/FRED_DFEDTAR.csv
- ../data/MarketData/Quandl/FRED_DFEDTARU.csv
- ../data/MarketData/Quandl/FRED_DFEDTARL.csv
- None
- Analyze sentiment of the statement text using Loughran and McDonald Sentiment Word Lists
- Plot sentiment (count of positive words with negation, negative words and net over time series, normalized by the number of words
- Load FED Rate, map the rate and decision to statement
- Plot the moving average of the sentiment along with FED rate and recession period
- Plot the same with Quantitative Easing and Chairpersons
Next, preprocess nontext meta data. Do necessary calculations and add to the calendar dataframe to map those latest available indices as input to the FOMC Fed rate decision.
- ../data/FOMC/fomc_calendar.pickle
- All Market Data and Economic Indices
- ../data/preprocessed/nontext_data
- ../data/preprocessed/nontext_ma2
- ../data/preprocessed/nontext_ma3
- ../data/preprocessed/nontext_ma6
- ../data/preprocessed/nontext_ma12
- ../data/preprocessed/treasury
- ../data/preprocessed/fomc_calendar
- Load and plot all numerical data
- Add FED Rate and rate decisions to FOMC Meeting Calendar
- Add QE as Lowering event and Tapering as Raising event
- Add the economic indices to the FOMC Meeting Calendar
- Calculate Taylor rule
- Calculate moving average
- Save data
- ../data/preprocessed/fomc_calendar.pickle
- ../data/FOMC/statement.pickle
- ../data/FOMC/minutes.pickle
- ../data/FOMC/meeting_script.pickle
- ../data/FOMC/presconf_script.pickle
- ../data/FOMC/speech.pickle
- ../data/FOMC/testimony.pickle
- ../data/preprocessed/text_no_split
- ../data/preprocessed/text_split_200
- ../data/preprocessed/text_keyword
- Add QE announcement to statement
- Add Rate and Decision to Statement, Minutes, Meeting Script and Presconf Script
- Add Word Count, Next Meeting Date, Next Meeting Rate and Next Meeting Decision to all inputs
- Remove return code and separate text by sections
- Remove short sections - having less number of words that threshold as it is unlikely to hold good information
- Split text of Step 5 to maximum of 200 words with 50 words overlap
- Filter text of Step 5 for those having keyword at least 2 times only
- ../data/preprocessed/nontext_data.pickle
- ../data/preprocessed/nontext_ma2.pickle
- ../data/preprocessed/nontext_ma3.pickle
- ../data/preprocessed/nontext_ma6.pickle
- ../data/preprocessed/nontext_ma12.pickle
- ../data/train_data/nontext_train_small
- ../data/train_data/nontext_train_large
- Check correlation to find good feature to predict Rate Decision
- Check correlation of moving average to Rate Decision
- Check correlation of calculated rates and changes by taylor rules
- Compare distribution of each feature between Rate Decision
- Fill missing values
- Create small dataset with selected 9 features and large dataset, which contains all
- ../data/train_data/nontext_train_small.pickle or
- ../data/train_data/nontext_train_large.pickle
- ../data/result/result_scores
- ../data/result/baseline_predictions
- ../data/result/training_data
- Balancing the classes
- Convert the target to integer starting from 0
- Train test split
- Apply 14 different classifiers to see how they perform
- Build and run random search and grid search cross validation models for the following classifiers
- ADA Boost on Decision Tree
- Extra Tree
- Random Forest
- Gradient Boosting
- Support Vector Machine
- Check Feature Importance
- Build and run Ensemble models
- Voting Classifier
- Stacking by XG Boost
- ../data/train_data/nontext_train_small.pickle
- ../data/preprocessed/text_no_split.pickle
- ../data/preprocessed/text_split_200.pickle
- ../data/preprocessed/text_keyword.pickle
- ../data/LoughranMcDonald/LoughranMcDonald_SentimentWordLists_2018.csv
- Check the record count, drop meeting scripts
- Select which text to use and merge the text to nontext train dataframe
- View text by creating corpus to see word frequencies
- Load LoughranMcDonald Sentiment word list and analyze the sentiment of each text
- Lemmatize, remove stop words, tokenize texts as well as sentiment word
- Vectorize the text by Tfidf
- Calculate Cosine Similarity and add difference from the previous text
- Convert the target to integer starting from 0, use Stratified KFold
- Model A - Use Cosine Similarity for Random Forest
- Model B - Use Tfidf vector and merge with meta data to perform Random Forest
- Model C - Use LSTM (RNN) based text analysis, then merge with meta data at the last dense layer
- Model D - Use GloVe Word Embedding for Model C
- Further split of training data to max 200 words with 50 words overlap and perform Model D again
- Model E - User BERT, then merge with meta data at the last dense layer
- ../data/preprocessed/text_no_split.pickle
- ../data/preprocessed/text_keyword.pickle
- ../data/models/finphrase_bert_trained.dict
- ../train_data/train_df.pickle
- ../train_data/sentiment_bert_result
- ../train_data/sentiment_bert_all
- ../train_data/sentiment_bert_stmt
- ../train_data/sentiment_bert_minutes
- ../train_data/sentiment_bert_presconf
- ../train_data/sentiment_bert_m_script
- ../train_data/sentiment_bert_speech
- ../train_data/sentiment_bert_testimony
- Check the record count, combine meeting scripts by speaker
- Split each text by sentence
- Load a trained BERT model and run prediction
- Count the number of sentences per predicted sentiment for each FOMC Meeting
- Visualize the result
- Combine the result with Non-text data
- Perform the same machine learning as the baseline model
- ../data/preprocessed/fomc_calendar.pickle
- ../data/preprocessed/nontext_data.pickle
- ../data/preprocessed/text_no_split.pickle
- ../data/train_data/train_df.pickle
- ../data/FOMC/statement.pickle
- Visualize FED Rate
- Visualize Economic Indices
- Visualize FOMC Text
- Visualize Sentiment
- Visualize Correlation, Taylor Rule
- Visualize the final result
- FomcGetCalendar.py - From FOMC Website, create fomc_calendar to save in pickle and csv
- FomcGetData.py - Calls relevant classes to get data from FOMC Website
- QuandlGetData.py - Get market data from Quandl.
- fomc_get_data/FomcBase.py - Base abstract class to scrape FOMC Website to download text data
- fomc_get_data/FomcStatement.py - Child class of FomcBase to retrieve statement texts
- fomc_get_data/FomcMinutes.py - Child class of FomcBase to retrieve minutes texts
- fomc_get_data/FomcPresConfScript.py - Child class of FomcBase to retrieve press conference script texts
- fomc_get_data/FomcMeetingScript.py - Child class of FomcBase to retrieve meeting script texts
- fomc_get_data/FomcSpeech.py - Child class of FomcBase to retrieve speech texts
- fomc_get_data/FomcTestimony.py - Child class of FomcBase to retrieve testimonny texts
The followings are used only for initial check and not required to run:
- FOMC_analyse_website.ipynb
- FOMC_analyse_website_2.ipynb
- FOMC_check_FEDRate.ipynb
- FOMC_Analysis_BERT_MultiSampleDropoutModel.ipynb
- FOMC_Analysis_BERT_Tensorflow.ipynb
- FOMC_Post_Training_BERT.ipynb
- FOMC_Text_Summarization.ipynb
Data attributes to the source (FRED, ISM, US Treasury and Quandl). Loughran McDonald dictionary attributes to https://sraf.nd.edu/textual-analysis/resources/ in University of Notre Dame. Feel free to use the source code as you would like!