README

Open Data Analysis - Bidirectional Best Hit (BBH) analysis

Welcome! This is a repository with a BLAST sequence analysis. In order to add information regarding the functional conservation we are going to perform an analysis motif. Particularly, we are going to focus on the functional conservation of TFs (which are proteins that bind to DNA sequence and active/repress the transcription of their regulated genes). We expected that if the orthologous TFs identified by BBH have a functional conservation, the sequences that aligned in the BBH must be enriched with DNA-binding motifs.

Project structure

## .
## ├── bin
## │   └── R
## │       ├── fun
## │       └── scripts
## ├── data
## │   ├── processed
## │   └── raw
## ├── docs
## └── figs
## 
## 9 directories

The repository contains the following folders:

  • bin: Contains binary files (e.g .R, .sh., .py, among other). For each coding language, two sub-folders exist: fun where all the functions are placed, and scripts where all the main codes are placed.

  • data: Contains flat-files (.tsv, .tab, .csv, among other). It contains two sub-folders: raw where the original datasets are placed and processed where all the files with modifications are placed.

  • figs: Contains all the figures either generated for the scripts or necessary to compile the .Rmd file for the report.

  • docs: .Rmd and .html files from the project report.

Pipeline Overview

The workflow consist of three main parts (Figure 1) each one associated with an script in R.

Figure 1. Workflow. First, we cleaned the data from different sources and merged them into one single table. Then, we extracted the sequences from both BBH output and motifs. Finally, we asked whether the DNA-binding motifs aligned with the BBH output

Used datasets

  1. ecoliAnnotation.tsv

A separated tabular file containing 9 columns. The data type of the columns is character.

  • Locus_tag : Unique identifier for a gene based on its genomic coordinates. It is an identifier to unify the “common” names of the gene from different databases.
  • NCBI_name : Name of the E. coli genes according to NCBI genome database.
  • Regulondb_name : Name of the E. coli genes according to Regulondb database.
  • Abasy_name : Name of the E. coli genes according to Abasy Atlas database.
  • Ecocyc_name : Name of the E. coli genes according to Ecocyc database.
  • Synonyms : Historic record of the gene’s name of E.coli.
ProteinID Locus_tag NCBI_name Regulondb_name Abasy_name Ecocyc_name Synonyms RegulondbID EC_product
NP_414542.1 b0001 thrL thrL thrL thrL ECK120001251 thr operon leader peptide

Note: The file was generated by the intersection of the following databases: Ecocyc (Karp et al. 2018), Regulondb (Santos-Zavaleta et al. 2019), Abasy Atlas (Escorcia-Rodríguez, Tauch, and Freyre-González 2020) and NCBI genome (Kuznetsov and Bollin 2021).

  1. geneAASeq.tsv:

A separated tabular file containing 2 columns. The data type of the columns is character.

  • ProteinID : Protein identifier.
  • Sequence : Sequence of amonoacids
ProteinID Sequence
NP_414542.1 MKRISTTITTTITITTGNGAG

Nota: This file was retrieved from NCBI genome (Kuznetsov and Bollin 2021).

  1. MotifsSeqRelation.tsv:

A separated tabular file containing 5 columns. The data type of the columns are characters and integers.

  • TF_name : Transcription Factor “common” name
  • Locus_tag : Locus tag of the gene
  • Motif_description : Type of motif: Ca-Binding-Region, Conserved-Region, Catalytic Domain, DNA-Binding-Region, Intramembrane-Region, Nucleotide-Phosphate-Binding-Region, Protein-Structure-Region, Alpha-Helix-Region, Beta-Strand-Region, Coiled-Coil-Region, Transmembrane-Region, and Zn-Finger-Region.
  • mSS : Motif sequence star.
  • mSE : Motif sequence end.
TF_name Locus_tag Motif_description mSS mSE
aaeR b3243 DNA-Binding-Region 19 38

This file contains the relationships between genes and its annotated motifs (both description and coordinates in the protein) from E. coliK-12 genome. This file was retrieved from Ecocyc (Karp et al. 2018).

  1. Orthologous_ECaaq_RZaadb_blastN_b1_m8.tab:

A separated tabular file containing 14 columns. The data type of the columns are characters, integers and doubles.

  • qName : query sequence identifier.
  • sName : subject sequence identifier.
  • peri : percent identity of the aligment.
  • alilen : Number of amnoacids aligned.
  • numMM : Number of mismatches in the aligment.
  • nnGP : Number of gaps in the alignment.
  • qSS : Query sequence start.
  • qSE : Query sequence end.
  • sSS : Subject sequence start.
  • sSE : Subject sequence end.
  • Evalue : E-value of the aligment.
  • bitScore : Bit score of the aligment.
  • qlen : Query sequence length in the aligment.
  • coveragePercen t: Civergae percent of the aligment.
qName sName peri alilen numMM nnGP qSS qSE sSS sSE Evalue bitScore qlen coveragePercent
gnl|ECaadb|100|NP_414651.1 gnl|RZaadb|1673 42.28 272 150 4 27 296 15 281 2e-50 166 297 90.5723905723906

This file contains the results of a BBH performed with BLASTp (Altschul et al. 1990).

  1. TFs_coli.txt:

A text file containing 1 column with the Transcription Factors’ names. The data type of the column is character.

TF_name
accB

Scripts overview

The workflow has three scripts implemented in R, each of the script accomplish one step of the workflow.

01_mergeData.R:

  • Objective: Join all the raw data in one single table.

  • Input:

    • ./data/raw/ecoliAnnotation.tsv
    • ./data/raw/MotifsSeqRelation.tsv
    • ./data/raw/geneAASeq.tsv
    • ./data/raw/TFs_coli.txt
    • ./data/raw/Orthologous_ECaaq_RZaadb_blastN_b1_m8.tab
  • Output: ./data/processed/01_mergeDatabase.tsv

02_motifPresenceRelationship.R

  • Objective: Extract motif and blast sequences and see if they aligned.

  • Input: ./data/processed/01_mergeDatabase.tsv

  • Output: ./data/processed/02_motifPresenceRelationship.tsv

03_pipeplot.R

  • Objective: Plot the results of the motif presence/Absence.

  • Input: ./data/processed/02_motifPresenceRelationship.tsv

  • Output: ./data/processed/pieplotMotifs.png

References

Altschul, S F, W Gish, W Miller, E W Myers, and D J Lipman. 1990. “Basic Local Alignment Search Tool.” Journal of Molecular Biology 215 (3): 403–10. https://doi.org/10.1016/S0022-2836(05)80360-2.

Escorcia-Rodríguez, Juan M, Andreas Tauch, and Julio A Freyre-González. 2020. “Abasy Atlas V2.2: The Most Comprehensive and up-to-Date Inventory of Meta-Curated, Historical, Bacterial Regulatory Networks, Their Completeness and System-Level Characterization.” Computational and Structural Biotechnology Journal 18 (May): 1228–37. https://doi.org/10.1016/j.csbj.2020.05.015.

Karp, Peter D, Wai Kit Ong, Suzanne Paley, Richard Billington, Ron Caspi, Carol Fulcher, Anamika Kothari, et al. 2018. “The EcoCyc Database.” EcoSal Plus 8 (1). https://doi.org/10.1128/ecosalplus.{ESP}-0006-2018.

Kuznetsov, Anatoliy, and Colleen J Bollin. 2021. “NCBI Genome Workbench: Desktop Software for Comparative Genomics, Visualization, and Genbank Data Submission.” Methods in Molecular Biology 2231: 261–95. https://doi.org/10.1007/978-1-0716-1036-7\_16.

Santos-Zavaleta, Alberto, Heladia Salgado, Socorro Gama-Castro, Mishael Sánchez-Pérez, Laura Gómez-Romero, Daniela Ledezma-Tejeida, Jair Santiago García-Sotelo, et al. 2019. “RegulonDB v 10.5: Tackling Challenges to Unify Classic and High Throughput Knowledge of Gene Regulation in e. Coli k-12.” Nucleic Acids Research 47 (D1): D212–20. https://doi.org/10.1093/nar/gky1077.