/divide

Divide NGS data by barcode and primer

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

join_and_split.py

Helper for join and split fastq files.

Required python 3.6 or above.

Usage

# Linux
## split
python3 join_and_split.py split -m fastq_file
## join
python3 join_and_split.py join -f forward.fastq -r reverse.fastq
# Windows
## split
python join_and_split.py split -m fastq_file
## join
python join_and_split.py join -f forward.fastq -r reverse.fastq

Use -t to set linker text, by default the program use "JOINTEXT".

When split, "fastq_file" could be multiple files, use "*.fastq" (include quotation mark) to represent all ".fastq" files in current folder.

divide.py

Divide NGS data by barcode and primer.

Prerequisite

  • Python 3.5 or above
  • Biopython
  • regex
  • vsearch (Optional)

To install Biopython and regex, run as administrator:

pip install biopython regex

Changelog

v4.6

Support ambiguous base.

v4.5

Extend vsearch options. Improve output

v4.2

Integrate vsearch.

v4.0

Use regex instead of BLAST. Faster and easier.

v3.3

Parallel version, use BLAST.

v2.1

Single core version. Use BLAST.

v1.0

Deprecated.

Sequence structure

It can handle merged pair-end sequence like this:

barcode-adapter-primer-sequence-primer-adapter-barcode

Or just handle one direction:

barcode-adapter-primer-sequence

Sequences will be divided by barcode according to given barcode file. If barcode is wrong even only one base, it will be dropped.

adapter

Some one adds sequence between barcode and primer, if you do not have it, just set adapter length to zero by "--adapter 0". The default value is 14.

Barcode mode

Use "-m" to set barcode mode, like "8*1", means barcode with length 5 repeats only 1 times. The default is "5*2", i.e., 5-base barcode repeats twice.

Note that the forward and reverse barcode may be different sequence, but they SHOULD FOLLOW THE SAME MODE!

Strict option

Use "-s" or "--strict" to use strict version. If set, the program will check barcode in head and tail is equal or not and whether barcode in tail (3') is correct. If not, it will only check barcode in head (5') of sequence.

Barcode file

Barcode file looks like this:

sample,barcode-f,barcode-r

S0001,ATACG,ATACG

S0002,ATATA,TATAC

S0003,ATACG

...

The barcode-f means barcode in 5' direction and barcode-r means barcode in 3' direction. All sequences should be forward.

If forward and reverse barcode are same, you can omit the reverse barcode in the table.

To avoid potential error, please do not use space in sample info.

And notice that here it use English comma to seperate two fields rather than Chinese comma.

Primer file

Primer file looks like this:

gene,forward,reverse

rbcL,ATCGATCGATCGA,TACGTACGTACG

matK,AAAATTTTCCCC,GGGGTTACCAAAA

...

Or:

gene,sequence

rbcL-f,ATCGATCGATCGA

rbcL-r,TACGTACGTACG

You can use Microsoft Excel to prepare these two files and save as CSV format, or use any text editor you prefer.

Make sure you don't miss the first line.

task.sh

If you use PBS task submitting system, you can use this script to submit the task, and you can finish the work from combine two direction sequence by flash and join_fastq.py to divide them.