This is the Knowledge Engine for Genomics (KnowEnG), an NIH BD2K Center of Excellence, Phenotype Prediction Pipeline.
This pipeline **predicts" the relative importance of a set of genes associated with a given phenotype.
This pipeline supports two regression models:
Options | Method | Parameters |
---|---|---|
Elastic Net | Elastic | elastic_net |
Lasso | Lasso | Lasso |
git clone https://github.com/KnowEnG/Phenotype_Prediction_Pipeline.git
apt-get install -y python3-pip
apt-get install -y libblas-dev liblapack-dev libatlas-base-dev gfortran
pip3 install numpy==1.11.1
pip3 install pandas==0.18.1
pip3 install scipy==0.18.0
pip3 install scikit-learn==0.17.1
apt-get install -y libfreetype6-dev libxft-dev
pip3 install matplotlib=1.4.2
pip3 install pyyaml
pip3 install knpackage
cd Phenotype_Prediction_Pipeline
cd test
make env_setup
- Run Elastic Net pipeline
make run_elastic_net
- Run Lasso pipeline
make run_lasso
Follow steps 1-3 above then do the following:
mkdir results_directory
Look for examples of run_parameters in:
Phenotype_Prediction_Pipeline/data/run_files BENCHMARK_2_ElasticNet.yml
Phenotype_Prediction_Pipeline/data/run_files BENCHMARK_1_Lasso.yml
Using Elastic net
python3 ../src/phenotype_prediction.py -run_directory ./run_dir -run_file BENCHMARK_2_ElasticNet.yml
Using Lasso
python3 ../src/phenotype_prediction.py -run_directory ./run_dir -run_file BENCHMARK_1_Lasso.yml
Key | Value | Comments |
---|---|---|
Method | elastic_net_predict | scikit-learn.org elastic-net |
Method | lasso_predict | scikit-learn.org lasso |
results_directory | directory | Directory to save the output files |
spreadsheet_name_full_path | spreadsheet_name | Input Gene Expression data |
response_name_full_path | response_name | Input Drug Response data |
test_spreadsheet_name_full_path | test_spreadsheet_name | Input testing feature data |
min_alpha | float number | Minimum number in alpha list |
max_alpha | float number | Maximum number in alpha list |
tolerance | float number | The tolerance for the optimization |
fit_intercept | boolean value | whether to calculate the intercept for this model |
normalize | boolean value | whether the regressors will be normalized |
max_iter | integer number | The maximum number of iterations |
n_alpha | integer number | Number of alphas in alpha list |
min_l1 | float number | Minimum l1 in the grid of l1 |
max_l1 | float number | Maximum l1 in the grid of l1 |
n_l1 | integer number | Length of grid of l1 |
eps | float number | Length of the path |
spreadsheet_name = features_train_clean.df
response_name = response_train_clean.df
test_spreadsheet_name = features_test_clean.df
Gene Name | Relarive Importance |
---|---|
User Gene 1 | Float |
... | ... |
User Gene n | Float |