This pipeline is designed to remove primer and adaptor sequences based on blast. The example input fasta and quality files are from a dataset generated using 454 sequencing platform. The script will blast the dataset against the given primer and adaptor sequences and generate output in m8 format.
-
python2.7 or python3.6 (or above)
-
To install Biopython through pip:
pip install biopython
-
To install blast, follow the instruction here
Fasta and quality files containing the reads and corresponding quality scores.
Default Primer Sequence: CGCCGTTTCCCAGTAGGTCTC
Default Adaptor Sequence: ACTGAGTGGGAGGCAAGGCACACAGGGGATAGG
python SeqProcessor.py -r Files_for_test/test.fna -q Files_for_test/test.qual
python SeqProcessor.py
Usage: SeqProcessor.py [options]
Options:
-h, --help show this help message and exit
-r FILE, --read=FILE raw reads fasta file
-q FILE, --qual=FILE raw reads quality file
-l LENGTH_CUTOFF, --len=LENGTH_CUTOFF
length cutoff
-s QUAL_CUTOFF, --score=QUAL_CUTOFF
length cutoff
-p PRIMER_SEQ, --primer=PRIMER_SEQ
primer sequence
-a ADAPTOR_SEQ, --adaptor=ADAPTOR_SEQ
adaptor sequence
-w WORD_SIZE, --word_size=WORD_SIZE
blast word size
-e EVALUE, --evalue=EVALUE
blast evalue
-t TOLERATE, --tolerate=TOLERATE
maximum tolerate error rate when triming primer and
adaptor
-o STRING, --outdir=STRING
output dir
- Total number of reads in the dataset.
- Total number of reads greater than 100 bp (cutoff can be changed through -l).
- Total number of reads with average quality scores greater than 20 (cutoff can be changed through -s).
- Total number of reads with primer sequences.
- Total number of reads with adaptor sequences.
- Total number of reads with both primer and adaptor sequences.
In addition, the script will generate a "Results" folder (folder name can be change through -o) containing the following files:
- blast_out.m8: Blast output file in m8 format.
- filtered_seq.fna: Fasta file containing reads greater than 100bp, average read quality scores greater than 20, primers and adaptors trimmed.
- alignment.tsv: Tab de-limited text file containing the read identifiers along with the starting and end positions of the primer or adaptor sequences.