/pgc_btip_topone

TopONE is a project that studies recombination events using topological data analysis. This was made in the PGC-BTIP 2024 internship by Alfonso Miguel Cruz and Claire Angelie Gabisay, and mentored by C.J. Palpal-latoc.

Primary LanguagePython

TopONE (Topological Omics for Non-tree Evolution)

This project aims to understand how recombination events influence various topological quantities of a genomic dataset using topological data analysis (TDA). The project seeks to determine whether topology can detect recombination events. More specifically, this project asks the following questions :

  1. Is topology robust to noise and sparse genomic samples?
  2. Does recombination change the topology of genomic samples?

The project's repository makes use of scripts and sequence files to answer the questions above. This project was made during the 2024 Bioinformatics Training and Internship Program headed by the Philippine Genome Center.

Installation

Population Genetics Software ms

ms is a population genetics software that generates samples under different models. The source code and software manual containing the installation and compilation instructions are found here.

Sequence Alignment Software nextclade

Nextclade is an online web application bioinformatics tool to align sequences and conduct phylogenetic analyses. It also has a CLI version found here, which performs the same functionalities as the web application version. Download the the latest version of the software based on your architecture (i.e. x86-64) and operating system.

Required Libraries

The list of libraries used by this project are listed in the requirements.yml file. Before installing the libraries, create a conda environment. To install the libraries, use the command:

conda env create --name test --file=requirements.yml

Pipelines

Goal 01: Is topology robust to noise and sparse genomic samples?

The pipeline begins with generating the simulated sequences from ms with defined parameters: number of samples, number of simulations, number of recombination sites, number of segregation sites or sample length, and the mutation and recombination rates. The output is a group of sequences in each simulation stored as a text file, where all sequences are in a biallelic format, which contains 0 or 1 as its only characters in each sequence. These sequences are then extracted from the text files. A sample of the generated sequences are located in the simseq folder.

Once the sequences have been produced, the Hamming distance matrix is generated by computing the Hamming distances of all sequences against each other, for every simulation.

After creating the Hamming distance matrix, it is passed to Ripser, a persistent homology library, to compute for the birth and death times of each homology group from $H_0$ to $H_2$. This process is repeated for every simulation.

Once the birth and death time pairs have been computed, the Betti numbers from $\beta_0$ to $\beta_2$, barcode length means, and barcode length variance, collectively known as the topological quantities, are computed for each simulation. These values are used for plotting and statistical analysis.

To investigate on how sparsity affects the topology, a fraction of the sequences from the population are taken before computing the Hamming distance matrices. We take a fraction of the population in values from 10% to 100% in increasing 10% increments to see any changes in the values of the topological quantities.

To investigate on how noise affects the topology and its quantities, we add a noise matrix to the computed Hamming distance matrix. This is done by creating a matrix $N$ with the same dimensions as the Hamming distance matrix. Each element $N_{ij}$ is sampled from a normal distribution $\mathcal{N}(0, \sigma^2)$, where $\sigma^2$ is the variance in values ranging from 0 to 100 with increasing increments of 5.

Goal 02: Does recombination change the topology of genomic samples?

In this pipeline, virus sequence samples were taken from the GISAID database. The recombinant lineage and its parent lineage samples were downloaded, each stored in a .fasta file. The files were reorganized to produce three new FASTA files:

  • Recombinant samples (recombinant lineage samples)
  • Nonrecombinant samples (both parent lineage samples)
  • Mixed samples (recombinant and parent lineage samples)

After organizing the sequences, all sequences in each file were aligned using the Nextclade CLI against the SARS-CoV-2 Wuhan reference genome. The aligned sequences were processed further by removing a column of nucleotides in all sequences if at least one sequence contains an unknown nucleotide. This processing results in sequences containing the four nucleotides (A, T, C, G) and gaps (-) only.

Once the sequences have been processed, we continue a similar pipeline as shown in Goal 01, by producing the Hamming distance matrices from the sequences, and computing the birth and death times per homology group from the Hamming distance matrix.

After obtaining the birth and death times, they are used for statistical analysis by using the Kruskal-Wallis Test and Independent Mann-Whitney U Test at a 5% level of significance.

Data

ms-generated Sequences

The simulated samples were generated from ms. These sequences are in a biallelic format, where the characters in a sequence are 1 or 0 only.

GISAID SARS-CoV-2 Sequences

The virus sequence samples, specifically SARS-CoV-2 samples, were taken from the GISAID database. Three recombinant lineages and their parent lineages were taken across different countries. The table below shows the lineages used and the countries whose samples were used:

Recombinant Lineage Parent Lineage 1 Parent Lineage 2 Countries
XBC.1 BA.2 B.1.617.2 China, Philippines, Singapore, South Korea, United States of America
XBE BA.5.2 BE.4.1 United Kingdom, United States of America
XBZ BA.5.2.1 EF.1.3 Austria, Denmark, Germany

Execution

Goal 01: Is topology robust to noise and sparse genomic samples?

Before generating the sequences, ensure that the software ms and the shell script popgensims.sh belong in the same directory, and the input directory inputs/simseq/ and output directories outputs/goal_one_dfs/ and outputs/goal_one_plots/ are also present as shown below:

pgc_btip_topone/
└── inputs/
     ├── ...
     └── simseq/
          ├── ...
          └── sims_n100_t50_r144.txt
└── outputs/
     ├── goal_one_dfs/
     └── goal_one_plots/
├── popgensims.sh
└── ms.exe

To generate the simulated sequences, we use ms on the shell script popgensims.sh to produce multiple simulations with different text files, based on the different configurations of number of samples and mutation and recombination rates. Prior to execution of the script, convert this script into an executable file using the command: chmod +x popgensims.sh. To execute the script, run the command on your terminal:

./popgensims.sh

The output of the simulated sequences are found at the simseq folder. Once the sequences have been generated, the plots can now be created.

The script goal_one_plots.py uses simulated sequence data to produce the plots necessary to answer this question.

To execute the script, simply execute python goal_one_plots.py on your command line with any of the following arguments:

  • --get-one-plot: gets one PDF file containing two plots: the variance and sparsity plots. Using this argument must also require the use of five additional arguments:

    • -nreps: number of simulations in the file
    • -nsite: number of segregating sites (sample length)
    • -nsam: simulated sample size
    • -t: mutation rate $\theta$
    • -r: recombination rate $\rho$
  • --get-all-plots: gets all PDF files containing the variance and sparsity plots, across every combination of possible simulated sample sizes, mutation rates, and recombination rates. Using this argument must also require the use of five additional arguments:

    • -nreps: number of simulations in the file
    • -nsite: number of segregating sites (sample length)
    • -ns: list of simulated sample sizes
    • -ts: list of mutation rates
    • -rs: list of recombination rates
  • --verbose: Optional argument that outputs the logs the start and end of each step in the script.

The outuput plots are stored in the goal_one_plots folder. The variance and sparsity plots are stored in a pdf file format. THe output dataframes are stored in the goal_one_dfs folder.

Some sample commands are shown below:

  • python goal_one_plots.py --get-one-plot -nreps 50 -nsite 300 -nsam 100 -t 50 -r 72 --verbose
  • python goal_one_plots.py --get-all-plots -nreps 50 -nsite 300 -ns 100 -ts 50 500 5000 -rs 4 12 36 72 144

Goal 02: Does recombination change the topology of genomic samples?

The script goal_two_plots.py uses the GISAID sequence data to produce the data required for the plots and for the statistical analysis.

The user can also provide their own input data, whether it is viral sequences (.fasta) or Hamming distance matrices (.csv), but must be stored in a folder inside the inputs directory, as shown below:

pgc_btip_topone/
└── inputs/
     ├── ...
     └── your_input_folder_here/

If viral sequences files will be used as your input, ensure that the sequences satisfy the following criteria:

  • Organized: Suppose cn_r.fasta, cn_p1.fasta, and cn_p2.fasta are the viral sequence data for the recombinant and parent lineages, respectively. The sequences must be used to produce recombinant, nonrecombinant, and mixed sequence .fasta files.

  • Aligned: The recombinant, nonrecombinant, and mixed sequence files must be aligned against the Wuhan reference genome to produce the aligned sequences. The Wuhan reference genome can be downloaded here.

Before organizing and aligning the sequence files, ensure that the project directory looks like this:

pgc_btip_topone/
└── inputs/
     ├── ...
     └── your_input_folder_here/
          ├── cn_r.fasta
          ├── cn_p1.fasta
          └── cn_p2.fasta
     └── wuhan_reference_genome.fasta
├── nextclade_align_sequences.sh
├── organize_sequences.sh
├── sequence_counts.csv
├── topONE.Rmd
└── nextclade-x86_64-unknown-linux-gnu

To organize the sequences, use the shell script organize_sequences.sh to group the sequences into recombinant, nonrecombinant, and mixed sequences stored in .fasta files. To convert this script into an executable, use the command chmod +x organize_sequences.sh. To run the executable on your terminal, run the command:

./organize_sequences.sh

Your input folder should look like this after running the script:

pgc_btip_topone/
└── inputs/
     ├── ...
     └── your_input_folder_here/
          ├── ...
          ├── cn_recom_unaligned.fasta
          ├── cn_nonrecom_unaligned.fasta
          └── cn_mixed_unaligned.fasta
     └── wuhan_reference_genome.fasta
├── ...
└── nextclade-x86_64-unknown-linux-gnu

After organizing the sequences accordingly, align the sequences using Nextclade's CLI through nextclade-x86_64-unknown-linux-gnu to align sequences against the Wuhan reference genome. To get the aligned sequences, use the script nextclade_align_seqeuences.sh to align all sequences across all organized files. To make the script executable, use the command chmod +x nextclade_align_seqeuences.sh and run the script as an executable:

./nextclade_align_sequences.sh

The aligned sequences are stored in a .fasta file within the input file directory shown below:

pgc_btip_topone/
└── inputs/
     ├── ...
     └── your_input_folder_here/
          ├── ...
          ├── cn_recom_aligned.fasta
          ├── cn_nonrecom_aligned.fasta
          └── cn_mixed_aligned.fasta
├── ...
└── nextclade-x86_64-unknown-linux-gnu

To execute the script, execute python goal_two_plots.py on your command line with any of the following arguments:

  • --mats : Provide the recombinant lineage name to retrieve the Hamming distance matrices per country associated with this lineage.
  • --seqs : Provide the recombinant lineage name to retrieve the organized aligned sequences in .fasta files provided by the user themselves, for each country they specify.
  • --fname : Provide the filename of the sequence counts file to be used.
  • --verbose: Optional argument that outputs the logs the start and end of each step in the script.

The sequence counts file is also used by the script to access the number of sequences found in each lineage for each country, which aids in accessing the input files and creating the dataframe containing birth and death times. A sample of this file is shown here.

The output dataframe is stored in the goal_two_dfs folder stored as a .csv file. The information from the dataframe is used by R markdown notebook topONE.Rmd to conduct the statistical analyses to answer the goal.

Some sample commands are shown below:

  • python goal_two_plots.py --mats xbe --fname sequence_counts.csv --verbose
  • python goal_two_plots.py --seqs xbf --fname sequence_counts_xbf.csv

Results

To view our results, you may access our presentation slide deck here.