This program aims to characterize viral communities present in a metagenomics sample using an adapted kmer-signature-based and similairty approach.
Given a metagenomics sample with short-read sequences in .fastq
format, VSigMatch characterizes viral communities present in the sample using an adapted kmer-signature-based and similairty approach. First all the sequences in the sample are chopped up into substrings length of k (kmers). The set of sample kmers would then be compared against all the virus signatures in the processed database to get a vector of overall similarities. Based on kmer abundance in the sample and their presence in each virus signatures along with the similarity vector, VsigMatch score: weighted k-mer-abundance similarity is computed for every signature and used to prioritize the virus with high chances to be present in the sample.
$ pip install -r requirements.txt
Identification of viral composition of a metagenomics sample given a viral
sequence database using an adapted kmer-signature-based and similairty approach.
optional arguments:
-h, --help show this help message and exit
--METAfastq METAfastq
fastq file containing metagenomics sequences to
chracterize
--kmer k length of substring
--topN N number of top virus names prioritized by the method
-output_dir output_dir
name of directory to save output files
To run this program in the command line interface type:
$ python3 run_VSigMatch.py --METAfastq 'inputs/sampled25_SRR23022001.fastq' --kmer 5 --topN 20 -output_dir 'outputs'
or use below command to use all default parameters
$ python3 run_VSigMatch.py -output_dir 'outputs'
$ python3 tests/tests.py
- A .fastq file containing short-reads produced by metagenomics sequencing
Example: sampled25_SRR23022001.fastq
:
@SRR23022001.4.1 4 length=151
AAAGGTTTATACCTTCCCAGGTAACAAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGT
+SRR23022001.4.1 4 length=151
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
@SRR23022001.10.1 10 length=151
AACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCT
+SRR23022001.10.1 10 length=151
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
...
- Output 1: A Bar plot of the top 20 viral taxonomy names with highest VSigMatch scores
Example: example_viral_char_barplot_k7.png
- Output 2: .txt file containing annotation of the top viruses: GenBank accession, Description, Taxonomy name
Example: example_viral_annotation_k7.txt
genbank_id description taxonomy_name
147 DQ136146.1 Tomato chlorosis virus isolate AT80/99 segment RNA2, complete sequence. Tomato chlorosis virus
213 KX926428.1 Watermelon mosaic virus isolate WMV-Pg, complete genome. Watermelon mosaic virus
301 KU754522.2 Lye Green virus isolate Dobs_PoolSeq4 VP1, VP2, VP3, putative glycoprotein, and putative polymerase genes, complete cds. Lye Green virus
...
Contact Info: Kewalin Samart; kewalin.samart@cuanschutz.edu