The purpose of this small library is to apply feature selection to your high-dimensional data. In order to do that, we apply the following steps:
- the input features are automatically standardized
- a column containing gaussian noise (average = 0, standard deviation = 1) is added
- a model is trained
- the feature importance is evaluated
- all the features that are less important than the random one, are excluded The previous steps are repeated iteratively, until the algorithm converges.
Let us start by cloning the repository, by using the following command:
git@github.com:apalladi/feature-selection-adding-noise.git
Then you need to install the dependencies. I suggest to create a virtual environment, as follows:
python3 -m venv .env
source .env/bin/activate
pip install -r requirements.txt
To check if everything works, you can run the unit tests:
python -m pytest tests/test.py
You are now ready to use the repository!
The example you need is contained in this notebook.
A toy dataset, to build a regression model, is imported.
Then we import the function get_relevant_features
from src.ml import get_relevant_features
This function takes as arguments:
features
labels
model
, a scikit-learn modelepochs
, the number of epochs (i.e. for how many cycles you want to apply recursively the feature selection)patience
, number of epochs without any improvement of the features selection, before stopping the process (the idea is similar to the early stopping of Tensorflow/Keras)splitting_type
, it can be equal tosimple
(for simple train/test split) orkfold
(for 5-fold splitting). If you choosekfold
, the feature importance will be computed as the average feature importance for each train/test subset.noise_type
, it can be equal togaussian
for gaussian noise orrandom
for flat random noisefilename_output
, a string to indicate where to save the file. You can also chooseNone
if you do not want to save itrandom_state
, set the random seed that it is used by the k-fold splitting
The function get_relevant_features
returns a DataFrame with a reduced dataset, i.e. a dataset that contains only the most important features.