To download, clone this repository
git clone https://github.com/jsunn-y/DeCOIL.git
To run DeCOIL optimization, the relevant anaconda environment can be installed from DeCOIL.yml
. To build this environment, run
cd ./DeCOIL
conda env create -f DeCOIL.yml
conda activate DeCOIL
DeCOIL can be executed on a personal computer using the following command:
python execute_DeCOIL.py --config_file 'file_name.json'
The configuration file should be located in DeCOIL/configs
with keys specified as follows:
Key (data config) | Description |
---|---|
name | filname for csv file containing sequences in the combinatorial space and corresponding predicted values. Must contain a column called 'Combo' which sqecifies the mutations at each site in the combinatorial library |
zs_name | Name of the column in the csv file, which contains the relative value of each variant. By default, lower values are more desirable. |
samples | Approximate screening size (number of sequences sampled from the library) |
sites | number of amino acid sites targeted in the combinatoral library |
library | optional argument. If not provided, optimizes from random initializations. If a string, loads the final libraries from a previous optimization with the same configuration name. If a list of strings, each string in the list should specify a mixed base template to load. |
Key (train config) | Description |
---|---|
type | type of optimization, only "greedy" is currently supported |
weight_type | List containing either "simple" or "full" for step and diffuse coverage respectively. If "full" is specified, the encoding type can also be specified, default is onehot. |
dist_function | Only required if weight_type contains "full". Can be "hamming", "manhattan", or "euclidean" |
zs_exp | p parameter, which is used to tune the desired amount of exploration and exploitation |
sigma |
|
seed | seed for reproducibility |
iters | number of iterations for greedy hill climbing |
samples | number of initial starting points and solutions, parallelized |
n_mix | number of templates per library to optimize simultaneously |
directions | number of random changes to the library to test in each iteration of greedy hill climbing |
num_repeats | number of sets of sequences to sample from the library for scoring, to get good statistics |
num_workers | number of processors to use |
top_fraction | fraction of sequences in a certain percentile of predictor scores, for logging purposes only |
After each iteration of optimization, average stats for each library will be printed. Weighted (weighted coverage) Unweighted (unweighted coverage) Raw Weighted Simple (weighted step coverage with p=1) Top Counts: Number of unique variants sampled by the library that lie in "top_fraction" Unique: Number of unique variants sampled by the library that are not stop codons
Outputs will be saved in DeCOIL/saved
under a folder with the name as the configuration file. The optimization trajectory can be accessed in results.npy
and example analyses are provided in analysis.ipynb
.
To run MLDE simulations, the relevant anaconda environment can be installed from MLDE_lite.yml
. To build this environment, run
cd ./DeCOIL
conda env create -f MLDE_lite.yml
conda activate MLDE_lite
The MLDE simulations demonstrated in our accompanying study can be reproduced below:
python execute_mlde.py --config_file 'file_name.json'
The configuration file should be located in MLDE_lite/configs
with keys specified as follows:
Key (data config) | Description |
---|---|
name | filname for csv file containing sequences in the combinatorial space and corresponding fitness values. Use "GB1_fitness.csv" for the GB1 dataset |
encoding | List of encodings to use, only supports "one-hot" |
library | Specifies how the training sequences should be sampled for each simulation. If a list of integers is provided, for an integer optimization/saved will be used |
n_solutions | must correspond to the number of libraries specified in "library" |
Key (model config) | Description |
---|---|
name | list of model classes to be used. Only supports "boosting" |
Key (train config) | Description |
---|---|
seed | seed for reproducibility |
n_samples | list of integers specifying the number of sequences to sample from a library for the training set |
n_splits | the number of models to be trained in the ensemble |
n_subsets | number of times to sample training sequences from the library, to get statistics |
num_workers | number of processors to use |
verbose | boolean specifying whether or not to print training updates |
save_model | boolean specifying whether or not to save the trained models |
Outputs will be saved in MLDE_lite/saved
under a folder with the name as the configuration file. The results of MLDE simulations can be accessed in mlde_results.npy
and example analyses are provided in analysis.ipynb
.
Relevant examples can be found under /configs/examples
and /saved/examples