Tip and tricks for BAM files

Usefull tools

Samtools organisation and repositories

File format specification
Samtools github page
Samtools webpage
Samtools man page

Compilation (from here):

git clone --branch=develop git://github.com/samtools/htslib.git
git clone --branch=develop git://github.com/samtools/bcftools.git
git clone --branch=develop git://github.com/samtools/samtools.git
cd bcftools; make
cd ../samtools; make
cd ../htslib; make

Other tools

biobambam2 github
Picard with its github page
R package rbamtools
bedtools documentation and github page

Tip and tricks

Check if BAM file is sorted

samtools view -H test.bam | grep @HD

Will show SO:coordinate if sorted.

Sort a BAM file

samtools sort -o test_sorted.bam -T tmp test.bam

Extract run ID, flow cell ID and Lane number

This only works for recent Illumina BAM files but can easily be adapted for other types and for FASTQ files. We extract the first column from the first read of the BAM that looks like HISEQ:292:C5GGUACXX:6:1101:16516:24562 and take fields 2, 3 and 4 separated by ::

samtools view test.bam | head -1 | cut -f1 | cut -f2-4 -d':' | tr ':' '\t'

Extract sample name

samtools view -H test.bam | grep '^@RG' | sed "s/.*SM:\([^\t]*\).*/\1/g" | uniq

To have the sample names of all BAM files in a folder with their file names:

for f in *.bam; do echo -ne "$f\t" ; samtools view -H $f | grep '^@RG' | sed "s/.*SM:\([^\t]*\).*/\1/g" | uniq; done

Change sample name

For all read groups in a BAM file

samtools view -H test.bam  | sed "s/SM:[^\t]*/SM:TEST_SAMPLE_NAME/g" | samtools reheader - test.bam > test_SM.bam

Simple variant calling using freebayes

freebayes -f ucsc.hg19.fasta --min-alternate-count 5 --no-complex --min-mapping-quality 20 --min-base-quality 20 \
  --min-coverage 20 test.bam | vcffilter -f "QUAL > 20" |  vcfbreakmulti | vt normalize - -q -r ucsc.hg19.fasta > test.vcf

Add a new ID-specified read group in the bam header

RGline=$(samtools view -H test.bam | grep '@RG' | head -1 | sed "s/ID:[^\t]*/ID:NEW_ID/g")
line_number=$(samtools view -H test.bam | sed -n '/@RG/=' - | head -1)
samtools view -H test.bam | sed "${line_number}"'i\'"${RGline}" | samtools reheader - test.bam > test_RGadded.bam
unset RGline; unset line_number

Calculate coverage for each position of a bed

bedtools coverage -d -a my_bed.bed -b my_bam.bam

Find where reads map in a BAM file

This will only report non-zero coverage and create contiguous regions with similar coverage (see bedtools manual). Here for example we only keep regions with at least 10 reads:

$ bedtools genomecov -bg -ibam test.bam | awk '$4>=10'
chr1	154566162	154566163	14
chr1	154566163	154566164	15
chr1	154566164	154566167	18
chr1	154566167	154566171	19

It can also be useful to group these regions and to report the total number of reads in each region. The first step is to merge contiguous regions identified by the previous command with bedtools merge. Here we also merge regions less than 100bp from each other:

$ bedtools genomecov -bg -ibam test.bam | awk '$4>=10' | bedtools merge -d 100 -i stdin
chr1	154566162	154566294
chr2	45189642	45189787

Finally we can now ask bedtools to count the number of reads in each of these regions using coverage, and the full command is:

$ bedtools genomecov -bg -ibam test.bam | awk '$4>=10' | bedtools merge -d 100 -i stdin | bedtools coverage -a stdin -b test.bam
chr1	154566162	154566294	21	132	132	1.0000000

Here for example there are 21 reads in region chr1:154566162-154566294. These 21 reads cover 132bp and the region itself is 132bp long, so 100% of the region is covered.

Convert BAM to FASTQ

Usefull links explaining why it's not so simple (for paired-end sequencing at least): http://crazyhottommy.blogspot.fr/2015/10/convert-bam-to-fastq-and-remap-it-with.html

Warning when reading old texts about this: htscmd bamshuf has been successively renamed samtools bamshuf and now samtools collate (since samtools v1.3). Similarly htscmd bam2fq has been successively renamed samtools bam2fq and now simply samtools fastq.

This commands allows to do it without intermediate files, including the final zip:

samtools collate -uOn 128 in.bam tmp-prefix | samtools fastq -s se.fq.gz - | gzip > in_interleaved_reads.fq.gz

One might often do this to realign afterward the FASQ file. In this case it's even possibe to avoid creating the intermediate FASTQ and to go directly from the BAM to the realigned BAM in a single command:

samtools collate -uOn 128 in.bam tmp-prefix | samtools fastq -s se.fq.gz - | bwa mem -p ref.fa -

Actually one can even directly mark duplicates and sort the final BAM in a single command like speedseq is doing using samblaster and sambamba.

IARCbioinfo/BAM-tricks