Charles Cazals: charles.cazals@hec.edu
Jean Chillet: jean.chillet@hec.edu
Antoine Demeire: antoine.demeire@hec.edu
Katrin Dimitrova: katrin.dimitrova@hec.edu
Alexdandre Leboucher: alexandre.leboucher@hec.edu
Suppose we are missing value at timestamp t for time serie i:
-
We look at the growth rate between time t-1 and t for all available time series.
-
To weight the actual relevance of the obtained growth rate for each time series, we use the overall correlation with the original time series i.
-
We then infer the growth rate of series i at time t:
where is the correlation of returns (not absolute values) of series i and j across all period.
Finally, instead of using all correlations raw, we can pre-process them before using them as weights. Here are a few examples:
To understand how to use this repo, we advise to look at our presentation before hand.
The repository is divided into 5 main folders:
-
./utils/
:
This folder contains all utilities required to use this project: - The scriptsetup.py
loads the initial data set and creates a symbolic link to the credentials needed to access blob storage.
To run this script outside ofmain.py
, type in the command line :bash python3 setup.py
- The scriptpreprocessing.py
loads the initial data set and saves two preprocessed datasets to/data/preprocessing
.
df_full
: data frame imputed using the baseline (linear interpolation)
df_miss
: data frame with values missing at random
To run this script, only type in the command line :bash python filtering.py <DATA_IN> <ACTION>
where :
<DATA_IN>
is the initial data set to be processed.
-
./notebooks/
:
This folder contains examples on how to apply the different functions in a python notebook environment. -correlations.ipynb
shows how to impute the dataset using the correlations-based model -evaluation.ipynb
shows how to impute the dataset using one of the two methods and how to evaluate this method relative to the baseline.
-
./data/
:
This folder isn't pushed in this repo as it is to heavy. It contains sub-folders with the initial and processed data set.
-
./results/
:
Running the algorithm will write a CSV with imputed missing values inside this folder.
-
./img/
:
This folder contains image resources.
Please run in the command line:
python3 main.py <ACTION> <MODEL>
where :
<ACTION>
denotes the task you want to perform
impute
will use the selected model to impute the initial datasetevaluate
will evaluate the imputation model against a reference dataset
<MODEL>
denotes the model you want to use (Facultative)
xgboost
will use a regression-based model to impute series individually (based on relevant data)correlations
will use a correlation-based model to impute series based on correlated assets
This will load the data (assuming you have the required credentials locally) process it and, depending on the ACTION, output an imputed dataset or evaluation results (pickled results dictionary and rmse boxplots) to ./results/
.