requirements:
- Python3.6: Python3 is OK, but I'd like to use fstring in Python3.6, if you're using Python3 but under 3.6, feel free to replace the
fstring
statement withformat
statement. - path.py
- biopython: BioPython is for fastq files processing.
- pysam: standard library for bam file processing.
- fastinterval: genomic interval processing.
it only works with sorted bam file, with UMI at the end of reads id:
M03074:71:000000000-B49DJ:1:2103:19133:13331:TTCGCACGA
M03074:71:000000000-B49DJ:1:2105:4089:17175:TTCGCACGA
for UMI pre-processing, here are some choices:
if your data is from illumina platform, just use bcl2fastq
.
Before you doing this, you must make sure the UMI cycles is in either read1 or read2 cycles, you can change Reads section in RunInfo.xml:
for example, if your UMI is in INDEX2, with 10 bases, your RunInfo.xml will looks like:
<Read Number="1" NumCycles="150" IsIndexedRead="N" />
<Read Number="2" NumCycles="8" IsIndexedRead="Y" />
<Read Number="3" NumCycles="10" IsIndexedRead="Y" />
<Read Number="4" NumCycles="150" IsIndexedRead="N" />
since bcl2fastq can not handle the UMI in the index region, you must combine the UMI cycles to reads2 cycles, like:
<Read Number="1" NumCycles="150" IsIndexedRead="N" />
<Read Number="2" NumCycles="8" IsIndexedRead="Y" />
<Read Number="4" NumCycles="160" IsIndexedRead="N" />
now, bcl2fastq will take the UMI cycles for Reads2.
set settings section in your SampleSheet.csv:
[Settings]
Read1StartFromCycle,1
Read1EndWithCycle,150
Read2StartFromCycle,1
Read2EndWithCycle,160
Read2UMIStartFromCycle,1
Read2UMILength,10
TrimUMI,1
make sure your csv file with same number of columns.
fastp is a fast all-in-one tool for preprocessing FastQ files.
It can extract UMI too.
using bwa
or other aligner, produce a sorted bam file, for example:
bwa mem -t 8 -M hg19.fasta read1.fastq read2.fastq | \
samtools view -@ 8 -S -b | \
samtools sort -@ 8 -o sample_name.sorted.bam
from consensus_maker import ConsensusWorker
worker = ConsensusWorker('sample_name.sorted.bam', 'consensus.1.fastq', 'consensus.2.fastq',
bed_file='target.bed')
worker.output_pe_reads()
TODO:
- make a simple script;
- UMI analysis and report;
- finish testing;
- support for
a-b
family andb-a
family.
welcome to pull a request.