andreas-thesis-gans: A Jupyter Notebook repository from sfragkoul

GENERATIVE-METHODS-IN-GENOMICS

Abstract

This project tackles the challenge of data scarcity in bioinformatics by exploring the use of generative methods for synthetic data creation in genomics. We focus on generating in silico FASTQ files with specific properties, employing a Generative Adversarial Network (GAN) model trained on a custom dataset of quality scores extracted from real FASTQ files.

Although encountering common GAN training challenges, the model manages to capture some distinctive features and patterns of the training data. This shows the potential of GANs for this task, particularly when suitable techniques are applied or if more advanced methods are integrated, like attention-based models.

This dissertation aims to serve as a stepping stone for further exploration and wider adoption of deep generative models in bioinformatics, in order to address the ongoing issue of data limitations.

Data

All the data used in this project are stored and openly accessible in a Google Drive Folder, along with the resulting files.

Framework

The overall framework of the complete algorithm is illustrated below:

Scripts Overview

Following is an overview of the scripts. Note that whle access has been provided for the directories used, they need to be adjusted for other users, as the files should be located to the "Shared with me" folder in Google Drive.

1.FASTQ_Data_Miner

The Data Miner module gathers FASTQ files from a specified source. It accepts an XML file from the European Nucleotide Archive (ENA) as input and extracts FASTQ files from one or multiple experiments, ensuring dataset diversity. However, in this particular use case, a specific project has been chosen to ensure homogeneity of the data (PRJEB44548).

2.FASTQ_Dataset_Organizer

The Dataset Organizer has been developed as a toolkit mainly for an extension of the script. In the currect project, only the functionality that creates a signle group is utilized. Other methods included in this script could be used in order to categorize and structure collected FASTQ files based on metadata information, handling unevenly distributed data, such as sequences with varying lengths, offering scalability and adaptability.

3.FASTQ_DataLoader

Responsible for preprocessing, the Data Loader formats and prepares the dataset to suit the input requirements of the GAN model, ensuring seamless integration into the training pipeline.

4.PHRED_GAN

The GAN architecture is specifically designed to generate synthetic quality scores, mirroring the characteristics of PHRED scores of authentic FASTQ files.

5.VARIANT_READS_GENERATOR

The Variant Reads Generator receives a FASTA file as input, which serves as the reference genome sequence for generating the synthetic reads, and generates reads of specific length. In addition to the FASTA file, the script also takes a CSV file containing variant information, such as SNPs, and applies them to the respective reads according to each position and Variant Allele Frequency (VAF).

6.FASTQ_GENERATOR

This script serves as the final step in the synthetic data generation pipeline. It integrates the variant reads produced by the VARIANT_READS_GENERATOR with the synthetic quality scores generated by the PHRED_GAN.

Results

Information about the resulting FASTQ file, in comparison with real ones, is demonstrated in the plots below:

Generated FASTQ file:

Real FASTQ files:

Observations:

The generated file shows both similarities and areas of improvement. For instance, while it mimics the general quality score distribution and GC content of real data, it exhibits distinct peaks in the middle and a higher frequency of lower quality scores. This suggests the model captures the overall trend but struggles with finer details of quality score distribution and nucleotide composition.

Discussion

This study demonstrates the potential of GANs for generating synthetic FASTQ files, but further exploration is needed. Hyperparameter tuning, alternative GAN architectures, and attention-based models could lead to improved performance. Expanding the scope to include diverse genomic variations and handling various sequence lengths would enhance the framework's applicability. Additionally, incorporating error profiles would further increase the realism of the generated FASTQ files.

sfragkoul/andreas-thesis-gans