Copyright (c) 2022 Maximilian W. Hofer & Kenneth A. Younge.
Repo maintainer: Maximilian W. Hofer (maximilian.hofer@epfl.ch)
AUTHOR: Maximilian W. Hofer
SOURCE: https://github.com/mxhofer/OrgSim-RL
LICENSE: Access to this code is provided under an MIT License.
The RiskyData-RL platform quantifies risk disclosures in IPO prospectuses using an LDA topic model.
Create your own copy of the repository to experiment with OrgSim-RL freely.
git clone https://github.com/mxhofer/RiskyData-LDA.git
pip install -r requirements.txt
NB: You might need to add the geckodriver such that Bokeh can save images in PNG format:
conda install -c conda-forge firefox geckodriver
We provide the Risky Data paper data in the data/ipo.csv
file. You can replace this file with your own risk factor text as long as the column with text-based risk is split into paragraphs using the ---new_paragraph---
divider.
To quantify textual risk factors, run the following scripts (in order):
- 00-preprocess.py: cleans text data (tokenization, stemming, etc.)
- Input: Download IPO data and store it as
data/ipo.csv
(URL in 00-preprocess.py) - Output:
data/ipo_allText.pkl
- Input: Download IPO data and store it as
- 01-fit.py: fits an LDA topic model and writes paragraph-level topic loadings to disk
- Input:
data/ipo_allText.pkl
- Output:
data/ipo_allTopics.pkl
- Input:
- 02-normalize.py: normalizes topic loadings to firm-level risks using year and industry groups
- Input:
data/ipo_allTopics.pkl
- Output:
data/ipo_risk.xlsx
- Input:
A note on performance: pre-processing text data and fitting the LDA topic model take time. On a MacBook Pro 2020, the pipeline takes ~2 hours to complete.
The results in data/ipo_risk.xlsx
contain the aggregate risk disclosure and the individual risk factors for each IPO firm.