/Blast2Bam

Primary LanguageCMIT LicenseMIT

Blast2Bam

With FastQ or Fasta input, map sequences with NCBI/Blastn and output the results in SAM/BAM format.

Fastq2fasta converts single or paired end fastQ in a single fasta file for Blastn input.

Blast2Bam uses the XML results of Blastn, the reference and the fastQ or fasta file(s) to output a SAM file.

In case of paired end data, Blast results from the first and second fastQ/fasta are paired in the SAM output.

The file generated by Blast2Bam is compatible with SAMtools and pass the picard-tools ValidateSamFile test (when the secondary alignments are removed).

Requirements

  • gcc
  • libxml2
  • xsltproc
  • zlib

Compilation

$ make

Usage

Fastq2fasta

$ fastq2fasta [options] FastQ_1 [FastQ_2] > out.fasta

Blast2bam

$ blast2bam [options] blast.xml ref.fasta FastQ_1 [FastQ_2] > out.sam

Options

Fastq2fasta

OptionDescription
-o FILEFile output. Default: stdout.
-hHelp

Blast2bam

OptionDescription
-o FILEFile output. Default: stdout.
-pInterleaved. The input fastQ/fasta is an interleaved list of sequences forward and reverse.
-R STRRead group header line. Should look like this : `@RG\tID:foo\t[...]`.
-W INTMinimum alignment length. The alignment is displayed only if its length is greater than INT. Default: minimum length calculated based on read length.
-cShort version of the CIGAR string ('M' instead of '=' and 'X').
-zAdjusted position. The position of the alignment is adjusted to the position of the reference.
-hHelp

Example

Makefile

.SHELL = /bin/bash
BINDIR = ../../bin/
SRCDIR = ../../src/
FASTQ = test_1.fastq.gz test_2.fastq.gz
DB = hiv.fasta

results.sam: $(addprefix ${BINDIR}, blast2bam fastq2fasta) ${FASTQ} $(addsuffix .nin, ${DB})
	$(addprefix ${BINDIR}, fastq2fasta) ${FASTQ} | \
	blastn -db ${DB} -num_threads 4 -word_size 8 -outfmt 5 | \
	$(addprefix ${BINDIR}, blast2bam) -o $@ -R '@RG\tID:foo\tSM:sample' - ${DB} ${FASTQ}

$(addprefix ${BINDIR}, blast2bam fastq2fasta):
	(cd ${SRCDIR}; ${MAKE})

$(addsuffix .nin,${DB}): ${DB}
	makeblastdb -in $< -dbtype 'nucl'

FastQ

First in pair
(...)
@ERR656485.2 2 length=300
NTGGGCTAAAGGCCTTTTCCTCTATTACTTTTACCCATGCATTTAAAGTTCTAGGTGACATGGCCTGGTGTACCATTTGCCCTTGGAGATTTTGCACTATAGGATAATTTTGACTGACCTTCCCGTCAGCCCCTTTTTCCTGCTGTGTCTTTTGCTGACTTTGCTGACATTTGTTTTGTTCTTCCTCTATCTTAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGATAGAGGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAGAAGAAAAGCAAGCAACACTAGG
+ERR656485.2 2 length=300
!8ACC@FGGGGGGGGFGGGGGGGGGGGGGGGGGGCFGGGGGGGGGGGFGGGGFCEFFGGGGGGGGGGGFGECFFGDGGGGGGGGGGEFGGGEGGGEFGGGGGGGGEFFFGGE?FFGGGGGFFGGGGGGFGGEGGGGGFGGGGGGGGGGFGGCEGDGGGGGGGGFGGGG9EGFFEGGGGGGGGGGGFFGF;DEGGGGFGGG>868EFGGFD?FGFGG55:FDFFF>:95??A=FAFF==B=46B47).1592::EEB?25-9@CE5>=/6>E02>96>(-.((.((,(((((,((((.,)(
(...)
Second in pair
(...)
@ERR656485.2 2 length=300
NAGATAGAGGAAGAACAAAACAAATGTCAGCAAAGTCAGCAAAAGACACAGCAGGAAAAAGGGGCTGACGGGAAGGTCAGTCAAAATTATCCTATAGTGCAAAATCTCCAAGGGCAAATGGTACACCAGGCCATGTCACCTAGAACTTTAAATGCATGGGTAAAAGTAATAGAGGAAAAGGCCTTTAGCCCAGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTCTCATTACAAAAAAAACATACACAATAAATGATATAAGCGGAATCAACAGCATGA
+ERR656485.2 2 length=300
!8A@CGGEFGFGCDFGGGGGGGGGGGGGGFGGGGGFGFGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGFGGFGGGGGGGGGEGFGFGGGFFGGGGGGGGFGGGGGGGGGGGGFFFFGGGGGG=FFGGFFDGGGGGGGG8FGFGGGGGGGGGFGGGGGGGGGGFDGGFGGFGGGFFFGFF8DFDFDFFFFFFFFFBCDB<@EAFB@ABAC@CDFF?4>EEFE<*>BDAFB@FFBFF>((6<5CC.;C;=D9106(.))).)-46<<))))))))))((,(-)))()((()))
(...)

Blastn-XML

First in pair
(...)
<Iteration>
  <Iteration_iter-num>3</Iteration_iter-num>
  <Iteration_query-ID>Query_3</Iteration_query-ID>
  <Iteration_query-def>ERR656485.2</Iteration_query-def>
  <Iteration_query-len>300</Iteration_query-len>
<Iteration_hits>
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gnl|BL_ORD_ID|0</Hit_id>
  <Hit_def>gi|9629357|ref|NC_001802.1| Human immunodeficiency virus 1, complete genome</Hit_def>
  <Hit_accession>0</Hit_accession>
  <Hit_len>9181</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>148.852</Hsp_bit-score>
      <Hsp_score>80</Hsp_score>
      <Hsp_evalue>4.07002e-39</Hsp_evalue>
      <Hsp_query-from>2</Hsp_query-from>
      <Hsp_query-to>120</Hsp_query-to>
      <Hsp_hit-from>833</Hsp_hit-from>
      <Hsp_hit-to>715</Hsp_hit-to>
      <Hsp_query-frame>1</Hsp_query-frame>
      <Hsp_hit-frame>-1</Hsp_hit-frame>
      <Hsp_identity>106</Hsp_identity>
      <Hsp_positive>106</Hsp_positive>
      <Hsp_gaps>0</Hsp_gaps>
      <Hsp_align-len>119</Hsp_align-len>
      <Hsp_qseq>TGGGCTAAAGGCCTTTTCCTCTATTACTTTTACCCATGCATTTAAAGTTCTAGGTGACATGGCCTGGTGTACCATTTGCCCTTGGAGATTTTGCACTATAGGATAATTTTGACTGACCT</Hsp_qseq>
      <Hsp_hseq>TGGGCTGAAAGCCTTCTCTTCTACTACTTTTACCCATGCATTTAAAGTTCTAGGTGATATGGCCTGATGTACCATTTGCCCCTGGATGTTCTGCACTATAGGGTAATTTTGGCTGACCT</Hsp_hseq>
      <Hsp_midline>|||||| || ||||| || |||| ||||||||||||||||||||||||||||||||| |||||||| |||||||||||||| ||||  || ||||||||||| |||||||| |||||||</Hsp_midline>
    </Hsp>
(...)
Second in pair
(...)
<Iteration>
  <Iteration_iter-num>4</Iteration_iter-num>
  <Iteration_query-ID>Query_4</Iteration_query-ID>
  <Iteration_query-def>ERR656485.2</Iteration_query-def>
  <Iteration_query-len>300</Iteration_query-len>
<Iteration_hits>
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gnl|BL_ORD_ID|0</Hit_id>
  <Hit_def>gi|9629357|ref|NC_001802.1| Human immunodeficiency virus 1, complete genome</Hit_def>
  <Hit_accession>0</Hit_accession>
  <Hit_len>9181</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>152.546</Hsp_bit-score>
      <Hsp_score>82</Hsp_score>
      <Hsp_evalue>3.14632e-40</Hsp_evalue>
      <Hsp_query-from>74</Hsp_query-from>
      <Hsp_query-to>194</Hsp_query-to>
      <Hsp_hit-from>715</Hsp_hit-from>
      <Hsp_hit-to>835</Hsp_hit-to>
      <Hsp_query-frame>1</Hsp_query-frame>
      <Hsp_hit-frame>1</Hsp_hit-frame>
      <Hsp_identity>108</Hsp_identity>
      <Hsp_positive>108</Hsp_positive>
      <Hsp_gaps>0</Hsp_gaps>
      <Hsp_align-len>121</Hsp_align-len>
      <Hsp_qseq>AGGTCAGTCAAAATTATCCTATAGTGCAAAATCTCCAAGGGCAAATGGTACACCAGGCCATGTCACCTAGAACTTTAAATGCATGGGTAAAAGTAATAGAGGAAAAGGCCTTTAGCCCAGA</Hsp_qseq>
      <Hsp_hseq>AGGTCAGCCAAAATTACCCTATAGTGCAGAACATCCAGGGGCAAATGGTACATCAGGCCATATCACCTAGAACTTTAAATGCATGGGTAAAAGTAGTAGAAGAGAAGGCTTTCAGCCCAGA</Hsp_hseq>
      <Hsp_midline>||||||| |||||||| ||||||||||| ||  |||| |||||||||||||| |||||||| ||||||||||||||||||||||||||||||||| |||| || ||||| || ||||||||</Hsp_midline>
    </Hsp>
(...)

SAM output

@SQ	SN:gi|9629357|ref|NC_001802.1|	LN:9181	
@RG	ID:foo	SM:sample
@PG	ID:Blast2Bam	PN:Blast2Bam	VN:0.1	CL:../../bin/blast2bam -o results.sam -R @RG\tID:foo\tSM:sample - hiv.fasta test_1.fastq.gz test_2.fastq.gz
(...)
ERR656485.2	83	gi|9629357|ref|NC_001802.1|	715	60	180S7=1X8=1X11=1X2=2X4=1X14=1X8=1X33=1X4=1X2=1X5=1X2=1X6=1S	=	715	-119	CCTAGTGTTGCTTGCTTTTCTTCTTTTTTTTTTCAAGCAGAAGACGGCATACGAGATCCTCTATCGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCTAAGATAGAGGAAGAACAAAACAAATGTCAGCAAAGTCAGCAAAAGACACAGCAGGAAAAAGGGGCTGACGGGAAGGTCAGTCAAAATTATCCTATAGTGCAAAATCTCCAAGGGCAAATGGTACACCAGGCCATGTCACCTAGAACTTTAAATGCATGGGTAAAAGTAATAGAGGAAAAGGCCTTTAGCCCAN	(),.((((,(((((,((.((.-(>69>20E>6/=>5EC@9-52?BEE::2951.)74B64=B==FFAF=A??59:>FFFDF:55GGFGF?DFGGFE868>GGGFGGGGED;FGFFGGGGGGGGGGGEFFGE9GGGGFGGGGGGGGDGECGGFGGGGGGGGGGFGGGGGEGGFGGGGGGFFGGGGGFF?EGGFFFEGGGGGGGGFEGGGEGGGFEGGGGGGGGGGDGFFCEGFGGGGGGGGGGGFFECFGGGGFGGGGGGGGGGGFCGGGGGGGGGGGGGGGGGGFGGGGGGGGF@CCA8!	NM:i:13	RG:Z:foo	AS:i:80	XB:f:148.852	XE:Z:4.07e-39
ERR656485.2	163	gi|9629357|ref|NC_001802.1|	715	60	73S7=1X8=1X11=1X2=2X4=1X14=1X8=1X33=1X4=1X2=1X5=1X2=1X8=106S	=	715	119	NAGATAGAGGAAGAACAAAACAAATGTCAGCAAAGTCAGCAAAAGACACAGCAGGAAAAAGGGGCTGACGGGAAGGTCAGTCAAAATTATCCTATAGTGCAAAATCTCCAAGGGCAAATGGTACACCAGGCCATGTCACCTAGAACTTTAAATGCATGGGTAAAAGTAATAGAGGAAAAGGCCTTTAGCCCAGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTCTCATTACAAAAAAAACATACACAATAAATGATATAAGCGGAATCAACAGCATGA	!8A@CGGEFGFGCDFGGGGGGGGGGGGGGFGGGGGFGFGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGFGGFGGGGGGGGGEGFGFGGGFFGGGGGGGGFGGGGGGGGGGGGFFFFGGGGGG=FFGGFFDGGGGGGGG8FGFGGGGGGGGGFGGGGGGGGGGFDGGFGGFGGGFFFGFF8DFDFDFFFFFFFFFBCDB<@EAFB@ABAC@CDFF?4>EEFE<*>BDAFB@FFBFF>((6<5CC.;C;=D9106(.))).)-46<<))))))))))((,(-)))()((()))	NM:i:13	RG:Z:foo	AS:i:82	XB:f:152.546	XE:Z:3.15e-40
(...)

IGV

igv01.png igv02.png

Authors

  • Aurélien Guy-Duché
  • Pierre Lindenbaum

Contribute

License

The project is licensed under the MIT2 license.