/Extract_Genes

A workflow to extract the spike gene sequences from SARS-CoV-2 full genome sequence.

Primary LanguagePython

Extract_spike_gene

This workflow extracts spike (glycoprotein) gene portions from SARS-CoV-2 whole genome sequences. Same principle can be applied while extracting gene sequences from longer genomes.

Tools

  1. blast command-line tool. Download from here
  2. Python 3.9.5

Test Data

The test dataset contains a set of SARS-CoV-2 complete genome sequences downloaded from GISAID database. This multi-fasta file is referred as seqfile in the subsequent steps. The query file contains a single fasta sequence - in here the surface glycoprotein gene sequence from SARS-CoV-2 Wuhan strain

Workflow

  1. Create a database of the whole genome sequences.
  2. Align spike query sequence to blast database. Use mapping coordinates to extract the target gene sequences from database.

Steps

Lauch the workflow console within LatchBio using this link. Create an account once prompted.

  1. Load a single spike gene sequence as query_sequence
  2. Load metafile containing full genome sequences as seq_data
  3. Execute the pipeline

Link to the code here

Maintenance