/protein_binding

A script that collects various features from multiple input files and outputs a file contains user specified features.

Primary LanguageJupyter NotebookMIT LicenseMIT

protein_binding

This repository contains the implementation of "Polypharmacology Within the Full Kinome: a Machine Learning Approach".

Parser

A script that collects various features from multiple input files and outputs a file of protein, drug, protein-drug pair features as specified by the user.

In order to run the parser:

python full_parser.py --p paths/to/protein_features --m /path/to/molec_descriptors --pm paths/to/protein/molec/features --feats /path/to/molec_features_list

where:

   --p specifies the (list) of protein feature filepaths
   --m denotes molecular descriptors file
   --pm specifies the list of protein-molec feature filepaths
   --feats specifies the file containing the features to keep from the E-Dragon molecular descriptors output
   --o specifies the output file name

Note that each of the arguments are optional.

example:

parser.py --p "SampleInputs/protein/protein_features_2struc.csv" "SampleInputs/protein/protein_features_coach_avg.csv" --pm "SampleInputs/1QCF_ML_Features/docking_summary_final.csv" --m "SampleInputs/MolecularDescriptors-Dragon7/outputExclusionMolecularDescriptors.txt" --feats "SampleInputs/MolecularDescriptors-Dragon7/descriptorsListTS2.txt"

Note: this script has been verified to run with python 3.6.1


Feature Selection

To run the random forest feature selection:

python feature_selection.py -f -data -null -strat -label -out -prot -names -root

where:

    -f: path(s) to set of initial features
    -data: path to initial dataset, this should at least contain the features specified by --f if given
    -null: path to null features
    -strat: imputation strategy used to fill in the null values
    -label: optional specify target label
    -out: output path to dir
    -prot: a flag that indicates that protein names are used as labels
    -names: list of proteins to exclude from training
    -root: root path for data and feature lists
    --split: if using dataset with seperate train/test splits, specify the split"