This workflow extracts spike (glycoprotein) gene portions from SARS-CoV-2 whole genome sequences. Same principle can be applied while extracting gene sequences from longer genomes.
- blast command-line tool. Download from here
- Python 3.9.5
The test dataset contains a set of SARS-CoV-2 complete genome sequences downloaded from GISAID database. This multi-fasta file is referred as seqfile in the subsequent steps. The query file contains a single fasta sequence - in here the surface glycoprotein gene sequence from SARS-CoV-2 Wuhan strain
- Create a database of the whole genome sequences.
- Align spike query sequence to blast database. Use mapping coordinates to extract the target gene sequences from database.
Lauch the workflow console within LatchBio using this link. Create an account once prompted.
- Load a single spike gene sequence as
query_sequence
- Load metafile containing full genome sequences as
seq_data
- Execute the pipeline
Link to the code here