"A high-throughput screen for transcription activation domains reveals their sequence characteristics and permits reliable prediction by deep learning"
-
Install git
(unless you already have it): -
Clone the repository in your computer (
git clone git@github.com:aerijman/ADpred_publication.git && cd ADpred_publication
) -
Build dependencies.
With conda:
install conda
.- In a terminal window, copy-paste:
conda create -n adpred python=3.7.4
.conda activate adpred
.while read requirement; do conda install --yes $requirement; done < dependencies.txt
.
With pip:
install pip
.- run
pip install -r requirements.txt
(You should havepython>=3.6.5
).
-
$(which pip) install -e git+https://github.com/marcoancona/DeepExplain.git#egg=deepexplain
-
run the notebook:
jupyter notebook analysis.ipynb
(I prefer jupyter lab)
-
The first cell of the jupyter notebook downloads the data from its Dropbox address.
Alternatively, You could download the data outside the notebookwget https://www.dropbox.com/s/vooe7mb62rnswp5/data2.tar.gz?dl=0
and start the notebook from cell 2. -
Figures are created running the scripts from the notebook. Many of the scripts are very slow. Hence, some of the processes includeed in the notebook have been modified and executed in a high performance cluster at the Fred Hutchinson Cancer Research Center (in some cases with GPUs). All scripts adapted for HPC can be provided upon request to aerijman@fredhutch.org or aerijman@neb.com.
Preprocessing is very time consuming and consisted of 1- pairing the two reads in fastq files; 2- filtering for reads with correct number of bases and without artifacts (e.g. internal stop codons); 3- translation into amino-acids; 4- clustering into similar sequences (some sequences differ in 1 or 2 aminoacids from their parental sequence due to errors during library-preparation and sequencing. We aleviate these sequence divergence in this step); 4- predict secondary structure and disorder from the amino-acid sequences.
- Constructing the complete insert from the reads with
FLASH_wrapper.py
which wraps FLASH tool - Translating the nucletide sequences into protein sequences with
translate.py
- Clustering similar sequences to reduce noise/variation from sequencing and dna handling errors with
run_usearch.sh
which wraps usearch tool andclusters.py
- Secondary structure and disorder predictions were automatized with external_software.sh and use psipred and iupred
ToDo LiSt: