/Phenotype_Prediction_Pipeline

Regression to infer an 'omic'-phenotype association

Primary LanguagePythonOtherNOASSERTION

KnowEnG's Phenotype Prediction Pipeline

This is the Knowledge Engine for Genomics (KnowEnG), an NIH BD2K Center of Excellence, Phenotype Prediction Pipeline.

This pipeline **predicts" the relative importance of a set of genes associated with a given phenotype.

This pipeline supports two regression models:

Options Method Parameters
Elastic Net Elastic elastic_net
Lasso Lasso Lasso

How to run this pipeline with our data


1. Clone the Phenotype_Prediction_Pipeline

 git clone https://github.com/KnowEnG/Phenotype_Prediction_Pipeline.git

2. Install the following (Ubuntu or Linux)

apt-get install -y python3-pip
apt-get install -y libblas-dev liblapack-dev libatlas-base-dev gfortran
pip3 install numpy==1.11.1
pip3 install pandas==0.18.1
pip3 install scipy==0.18.0
pip3 install scikit-learn==0.17.1
apt-get install -y libfreetype6-dev libxft-dev
pip3 install matplotlib=1.4.2
pip3 install pyyaml
pip3 install knpackage

3. Change directory to Phenotype_Prediction_Pipeline

cd Phenotype_Prediction_Pipeline

4. Change directory to test

cd test

5. Create a local directory "run_dir" and place all the run files in it

make env_setup

6. Select and run a phenotype prediction option:

  • Run Elastic Net pipeline
make run_elastic_net
  • Run Lasso pipeline
make run_lasso

Follow steps 1-3 above then do the following:

* Create your results directory

mkdir results_directory

* Create run_parameters file (YAML Format)

Look for examples of run_parameters in:

Phenotype_Prediction_Pipeline/data/run_files BENCHMARK_2_ElasticNet.yml

Phenotype_Prediction_Pipeline/data/run_files BENCHMARK_1_Lasso.yml

* Run

Using Elastic net

python3 ../src/phenotype_prediction.py -run_directory ./run_dir -run_file BENCHMARK_2_ElasticNet.yml

Using Lasso

python3 ../src/phenotype_prediction.py -run_directory ./run_dir -run_file BENCHMARK_1_Lasso.yml

Description of "run_parameters" file


Key Value Comments
Method elastic_net_predict scikit-learn.org elastic-net
Method lasso_predict scikit-learn.org lasso
results_directory directory Directory to save the output files
spreadsheet_name_full_path spreadsheet_name Input Gene Expression data
response_name_full_path response_name Input Drug Response data
test_spreadsheet_name_full_path test_spreadsheet_name Input testing feature data
min_alpha float number Minimum number in alpha list
max_alpha float number Maximum number in alpha list
tolerance float number The tolerance for the optimization
fit_intercept boolean value whether to calculate the intercept for this model
normalize boolean value whether the regressors will be normalized
max_iter integer number The maximum number of iterations
n_alpha integer number Number of alphas in alpha list
min_l1 float number Minimum l1 in the grid of l1
max_l1 float number Maximum l1 in the grid of l1
n_l1 integer number Length of grid of l1
eps float number Length of the path

spreadsheet_name = features_train_clean.df
response_name = response_train_clean.df
test_spreadsheet_name = features_test_clean.df


Description of Output files are saved in results directory


Gene Name Relarive Importance
User Gene 1 Float
... ...
User Gene n Float