/ADpred_publication

scripts and notebook for publication of ADpred

Primary LanguageJupyter NotebookMIT LicenseMIT

Bioinformatic analysis for the manuscript:

"A high-throughput screen for transcription activation domains reveals their sequence characteristics and permits reliable prediction by deep learning"

Installation

  1. Install git (unless you already have it):

  2. Clone the repository in your computer (git clone git@github.com:aerijman/ADpred_publication.git && cd ADpred_publication)

  3. Build dependencies.

    With conda:

    • install conda.
    • In a terminal window, copy-paste:
      • conda create -n adpred python=3.7.4.
      • conda activate adpred.
      • while read requirement; do conda install --yes $requirement; done < dependencies.txt.

    With pip:

    • install pip.
    • run pip install -r requirements.txt (You should have python>=3.6.5).
  4. $(which pip) install -e git+https://github.com/marcoancona/DeepExplain.git#egg=deepexplain

  5. run the notebook: jupyter notebook analysis.ipynb (I prefer jupyter lab)


Analysis

  • The first cell of the jupyter notebook downloads the data from its Dropbox address.
    Alternatively, You could download the data outside the notebook wget https://www.dropbox.com/s/vooe7mb62rnswp5/data2.tar.gz?dl=0 and start the notebook from cell 2.

  • Figures are created running the scripts from the notebook. Many of the scripts are very slow. Hence, some of the processes includeed in the notebook have been modified and executed in a high performance cluster at the Fred Hutchinson Cancer Research Center (in some cases with GPUs). All scripts adapted for HPC can be provided upon request to aerijman@fredhutch.org or aerijman@neb.com.



Preprocessig (already done):

Preprocessing is very time consuming and consisted of 1- pairing the two reads in fastq files; 2- filtering for reads with correct number of bases and without artifacts (e.g. internal stop codons); 3- translation into amino-acids; 4- clustering into similar sequences (some sequences differ in 1 or 2 aminoacids from their parental sequence due to errors during library-preparation and sequencing. We aleviate these sequence divergence in this step); 4- predict secondary structure and disorder from the amino-acid sequences.

  1. Constructing the complete insert from the reads with FLASH_wrapper.py which wraps FLASH tool
  2. Translating the nucletide sequences into protein sequences with translate.py
  3. Clustering similar sequences to reduce noise/variation from sequencing and dna handling errors with run_usearch.sh which wraps usearch tool and clusters.py
  4. Secondary structure and disorder predictions were automatized with external_software.sh and use psipred and iupred

ToDo LiSt: