/Evans-Yamamoto_et_al_2023

The computer codes and control files, defining the settings for the codeml program execution, used in the publication Parallel nonfunctionalization of CK1δ/ε kinase ohnologs following a whole-genome duplication event.

Primary LanguageJupyter NotebookOtherNOASSERTION

Evans-Yamamoto_et_al_2023

This repository contains the computer codes and control files, defining the settings for the codeml program execution, used in Evans-Yamamoto et al (2023) Parallel nonfunctionalization of CK1δ/ε kinase ohnologs following a whole-genome duplication event.

Installation

Please make sure you have appropriate Python, pip, and R before starting.

Python version >= 3.5
pip    version >= 1.1.0
R      version >= 4.2.2

Download scripts by first clone this repository by execuiting the following command in the terminal.

git clone https://github.com/LandryLab/EVANS-Yamamoto_et_al_2023.git

Dependencies

  • Python

    numpy  version >=1.19
    pandas version >=1.3.4
    

    In the terminal, go to the location of the downloaded folder, and install the dependencies above by executing the following command.

    pip install .
  • R

    sessioninfo   version >=1.2.2	
    ggplot2       version >=3.4.2
    reshape2      version >=1.4.4
    GGally        version >=2.1.2	
    ggridges      version >=0.5.4
    plyr          version >=1.8.8	
    dplyr         version >=1.1.2
    tidyr         version >=1.3.0
    tidyverse     version >=2.0.0	
    Cairo         version >=1.6.0	
    matrixStats   version >=1.0.0
    forcats       version >=1.0.0
    hardhat       version >=1.3.0	
    gridExtra     version >=2.3	
    ggExtra       version >=0.10.0
    egg           version >=0.4.5
    devtools      version >=2.4.5	
    ggtree        version >=3.6.2	
    castor        version >=1.7.10	
    treeio        version >=1.22.0	
    TreeTools     version >=1.9.2
    stringr       version >=1.5.0
    cowplot       version >=1.1.1	
    ggpubr        version >=0.6.0	
    gggenes       version >=0.5.0	

    To install these packages, execute the following script in the terminal.

    Rscript install_dependencies.r

Other programs

List of content and description

This repository contains the following folders. The folders are numbered in sequencial order for execution.

00_Preliminary_analysis

Scripts regarding the preliminary analysis on YGOB data base, idintifying essential genes in S.cerevisiae, which are maintained as duplicates in other species.

  • YGOB_wgd_essentiality_stats.csv
    Input data, created from Gene order & annotation from the YGOB database and gene essentiallity information from the SGD database.
  • YGOB_stats_R.ipynb/.html
    Script to filter and count species maintaining ohnologs for each gene.

  • YGOB_ScerEssential_ScerCount1_ZscorePostWGDOver2.csv
    Stats outputed from YGOB_stats_R.

01_Ortholog_sequence_retrieval

Scripts regarding Ortholog sequence retrieval section in the manuscript. It contains the following folders and files;

01_RefSeq_Protein_retrival

  • Saccharomycetaceae_species.csv
    List of Saccharomycetaceae species in the NCBI database.

  • download_ncbi_genomes.sh
    Script to download files from NCBI.

  • NCBI_download_wrapper.ipynb
    Python notebook to download the genomes and protein files for the species listed in Saccharomycetaceae_species.csv.

  • 2023-03-01_NCBI_download_summary.csv
    Intermediate output from NCBI_download_wrapper.ipynb.

  • hrr25.faa
    Fasta file containing the S. cerevisiae Hrr25p sequence.

  • blast.ipynb

    Python notebook to create BLASTp databases from the downloaded protein files (under ./blastp/db), and perform BLASTp using the S. cerevisiae Hrr25p (output under ./blastp/out).

  • blastp
    Folder containing intermediate files for protein blast.

  • 2023-03-02_BLASTp_parsed.csv
    Parsed data from the protein blast.

  • Saccharomycetaceae_BLASTp_hits.fasta
    Protein fasta file containing the 206 identified orthologs in the first alignment.

  • 2023-03-02_NCBI_BLASTp_SGD_hits_parsed.xlsx Excel file containing BLASTp results of Saccharomycetaceae_BLASTp_hits.fasta against the SGD database (S. cerevisiae proteins).

  • Saccharomycetaceae_Hrr25_summary.csv csv file with summary of extracted Hrr25p sequences, removing all false positives. It also includes annotated orthologs in the YGOB database. the

02_Phylogenetic_tree

  • 1672taxa_290genes_bb_1.treefile
    Phylogenetic tree file from Li et al. (2021) Current Biology

  • tree_Li_etal_2021.ipynb
    R script in jupyter notebook to load and trim the phylogenetic tree, based on a set of species presnet in ../01_RefSeq_Protein_retrival/Saccharomycetaceae_Hrr25_summary.

  • selected_species_tree.txt
    Trimmed tree output from the script.

  • sel_species.csv
    Output from the script, with list of spcecies present in the trimmed tree.

  • SelectedSpeciesTree_plot.pdf
    Visualized tree output from tree_Li_etal_2021.ipynb.

03_Extended_homolog_search

  • search_homolog.ipynb
    Script to perform BLAST alignments agaisnt all genoe sequences using ./blast/db/HRR25_nuc_nonAligned.fasta as query.

  • blast
    Folder containing database and outputs from BLAST alignments.

  • genomes.zip
    Compressed folder with genoome sequences which orthologs are going to be retrieved from. Since this file is too large to upload to github, it is available here.

  • blast_hits.csv
    A file containing all BLAST hits, present in ./blast/out.

  • unique_regions_to_extract.csv
    Unique gene regions parsed from blast_hits.csv

  • HRR25_homologs_nt_extracted.fna
    Fasta file containing homologs identified from genomic seuquences.

  • HRR25_merged.fna
    The result from this folder (HRR25_homologs_nt_extracted.fna) was merged with the input for homology search (./blast/db/HRR25_nuc_nonAligned.fasta) to be used for downstream analysis.

04_Cleanup_homolog

  • alignment4ORFdetection.ipynb
    Python script to perform MAFFT-linsi and identify ORF regions for HRR25_merged.fna.

  • HRR25_mafft_linsi.txt
    Output from MAFFT-linsi.

  • HRR25_homologs_aa_trimmed.fna
    Output from alignment4ORFdetection.ipynb, contiaining protein sequences in fasta format.

  • HRR25_trimed_aa_info.csv
    Output from alignment4ORFdetection.ipynb, contiaining protein sequences in csv format.

  • HRR25_homologs_nt_trimmed.fna
    Output from alignment4ORFdetection.ipynb, contiaining nucleotide sequences in fasta format.

  • HRR25_trimed_nt_info.csv
    Output from alignment4ORFdetection.ipynb, contiaining nucleotide sequences in csv format.

  • TableS1_ListofGenes.xlsx
    The output from 04_Cleanup_homolog was used to create a list of orthologs presented in Supplementary Table 1 (TableS1_ListofGenes.xlsx) of the manuscript. I assigned each ortholog a unique ID (present in the column GeneID_codeml), since codeml requires identifiers which are short. Using this file, I created inputs for downstrream analysis which are present in the folder 05_gene_tree_construction.

05_gene_tree_construction

  • HRR25_geneanalysis_aa.fna and HRR25_geneanalysis_nt.fna
    Fasta files containing the ortholog sequences identified by unique IDs, created from TableS1_ListofGenes.xlsx.

  • trim_protein.ipynb
    Python notebook to create inputs for TranslatorX, a program to perform alignment based on codons.

  • HRR25_geneanalysis_aa_trimmed.fna
    Output from trim_protein.ipynb, where protein sequence is properly annotated (excluding regions after stop codons etc).

  • HRR25_geneanalysis_nt_translatorXinput.fna
    Output from trim_protein.ipynb, with nucleotide sequences corresponding to HRR25_geneanalysis_aa_trimmed.fna. I use file this for input in TranslatorX.

  • translatorX_perl
    A folder containing scripts from TranslatorX

  • translatorX_res
    A folder containing results from TranslatorX, using HRR25_geneanalysis_nt_translatorXinput.fna as input.

  • raxml_res
    A folder containing scripts and results from raxml-ng. I created the input file which only contains orthologs from post-WGD species which maintained two orthologs (HRR25_mafft_translatorx.nt_ali_PostWGD_selected.fasta) from the output of TranslatorX (HRR25_mafft_translatorx.nt_ali.fasta). The resulting tree was used manually create the recomciliated tree y replacing the post-WGD species with maintained duplicates with the tree presented in HRR25_mafft_translatorx.nt_ali_PostWGD_selected.fasta.raxml.bestTree. The resulting tree can be found in HRR25_genetree_postWGDGeneTreeIntegrated_ID_M0.txt.

02_Multiple_Sequence_Alignment_analysis

Scripts to reproduce Figure 1C of the paper.

  • input
    Input for this analysis is the codon based alignment of orthologs, identical to ../05_gene_tree_construction/translatorX_res/HRR25_mafft_translatorx.aa_ali.fasta.

  • meta_data
    Folder with meta data, includig domain annotations and position information to aid interpretation of the plots.

  • output
    Folder with outputs, including intermediate files with similarity scores by position.

  • msa_analysis.ipynb
    A R script in jupyter-notebook, which was used to calculate the similarity score for each residue in orthologs.

  • plot_similarity.ipynb
    A R script in jupyter-notebook, which was used visualize the data as presented in Figure 1C.

03_dNdS_analysis

Scripts to reproduce Figure 1D-G of the paper.

00_data_preparation
Data presented in the raw_file folder is proccessed using the script alignment2nogap.ipynb in order to create fasta files for codeml analysis. Some manual modifications (inserting the header for file format etc) was performed to ensure proper execution of codeml.

01_codeml
In this folder, the inputs, control files (*.ctl), log files, and outputs from codeml are shown.

02_evolution_rate_analysis
In this folder, intermediate files for generating figures based on codeml output is presented, as well as scripts and visualized output.

  • domain_dNdS_heatmap.ipynb
    Script to vizualize the domain based dN/dS values as heatmap.
  • evolutionary_rate_analysis_R.ipynb
    Script to analyze branch lengths and assymtry from codeml output (Figure 1E-G).
  • Results
    Folder containing all plots

04_Combinatorial_functional_complementation_screening

Scripts and output related to combinatorial complementation screening.

  • Input

    • Sample information for analysis
    • Image data from S&P imager (Available upon request to the corresponding author)
    • Numeric values extracted from the Image data (available here)
  • Scripts

    • 01_QuantifyAreaFromPlatePicture.ipynb
      Script to extract colony area from each image.

    • 02_AUC_computation.ipynb
      Script to compute Area Under the Curve from colony area information.

    • 03_parse_auc_data_2_scores_20230828.ipynb
      Script to compute complementation scores, using AUC values in selectio nand non-selection conditions.

    • 04_plot_heatmap.ipynb
      Script to plot heatmap from the complementation scores.

  • Output
    Files generated from the scripts. Plots were used to prepare Figure 2C and Figure 2D of the paper.

05_DHFR-PCA_assay

Scripts and output related to the DHFR-PCA screening.

  • Input

    • Sample information for analysis
    • Image data from S&P imager (Available upon request to the corresponding author)
    • Numeric values extracted from the Image data (2022-12-09_MTX_Sel2_AUC_data_Cterm.csv)
  • Scripts

    • 01_robotpics_analysis.ipynb
      Script to extract colony area from each image.

    • 02_AUC_computation.ipynb
      Script to compute PPI scores, using AUC values.

    • 03_parse_screening_data.ipynb
      Script to parse screening information and PPI data.

    • 04_Analysis.ipynb
      Script to analyze PPI data and output stats.

  • Output
    Plots and intermediate files generated from the scripts. Plots were used to make Figure 3C and 3D of the paper.

06_GO_enrichment_analysis_of_PPI_partners

  • Input
    PPI score data (HRR25_orthologs_PPI_screening_parsed_2023-02-17DEY.csv) from 05_DHFR-PCA_assay.

  • Scripts

    • 01_data_proccessing.ipynb
      Script to proccess PPI data and meta data for GO enrichment analysis.
    • 02_GO_Analysis.ipynb
      Script to perform GO enrichment analysis on PPI partners.
  • Output
    Plots and files generated from the scripts. The folder GO_results contians csv files for GO enrichment analysis results for each ortholog's PPI partner, which is combined to one file as seen in GO_aggregated_results.csv. Figures were used to make Figure 3B of the paper.

07_SH3_domain_motif_analysis

  • Input

    • PPI score data (HRR25_orthologs_PPI_screening_parsed_2023-02-17DEY.csv) from 05_DHFR-PCA_assay.
    • pwm_dir (folder containing SH3 posision weight matrix from this paper
    • Protein fasta files of HRR25 orthologs and the yeast proteome for motif search.
    • ID conversion file for SH3 proteins (yeast_sh3_accession_to_GN.txt).
  • Scripts

    • 01_motif_search.ipynb
      Script to evaluate SH3 binding motifs in HRR25 orthologs.
    • 02_plot_PPIandSH3Motif.ipynb
      Script to visualize the results.
  • Output
    Plots and files generated from the scripts. The folder contians a csv file (SH3_PWM_scan_HRR25Orthologs_MSS.csv) with all values from the PWM matches. Plots are as shown in Figure 3D of the paper.