- Table of Contents
- Introduction
- Installing SAMtools
- Basic usage
- Viewing
- Converting a SAM file to a BAM file
- Converting a BAM file to a CRAM file
- Sorting a BAM file
- Converting SAM directly to a sorted BAM file
- Creating a BAM index file
- Converting a BAM file to a SAM file
- Filtering out unmapped reads in BAM files
- Extracting SAM entries mapping to a specific loci
- Extracting only the first read from paired end BAM files
- Stats
- Interpreting the BAM flags
- samtools calmd/fillmd
- Creating fastq files from a BAM file
- Random subsampling of BAM file
- Fastest way to count number of reads
- Obtaining genomic sequence
Created by gh-md-toc
One of my most popular pages on my website is my SAMtools page, which I created back in 2011. Unfortunately I can no longer edit that page for technical reasons, so I have migrated the information here. My examples do not include the -@
argument, which allows the extremely useful feature of multi-threading. This is a very useful feature given that BAM files can get rather huge these days. For the latest information, please refer to the release notes.
I have created an example SAM file to demonstrate the commands. The steps to create aln.sam
are inside the run.sh
script. The reads in the SAM file are created from a randomly generated reference sequence; typing make
will run all the steps used to create the file.
git clone https://github.com/davetang/learning_bam_file.git
cd learning_bam_file/
make
SAMtools provides various tools for manipulating alignments in the SAM/BAM format. The SAM (Sequence Alignment/Map) format (BAM is just the binary form of SAM) is currently the de facto standard for storing large nucleotide sequence alignments. If you are dealing with high-throughput sequencing data, at some point you will probably have to deal with SAM/BAM files, so familiarise yourself with them! All of the examples below, use the aln.sam
example SAM file that I created.
Just copy and paste the code below.
wget https://github.com/samtools/htslib/releases/download/1.7/htslib-1.7.tar.bz2
tar -xjf htslib-1.7.tar.bz2
cd htslib-1.7
make
make prefix=. install
cd ..
wget https://github.com/samtools/samtools/releases/download/1.7/samtools-1.7.tar.bz2
tar -xjf samtools-1.7.tar.bz2
cd samtools-1.7
make
make prefix=. install
cd ..
If you run SAMtools on the terminal without any parameters, all the available utilities are listed:
samtools
Program: samtools (Tools for alignments in the SAM format)
Version: 1.7 (using htslib 1.7)
Usage: samtools <command> [options]
Commands:
-- Indexing
dict create a sequence dictionary file
faidx index/extract FASTA
index index alignment
-- Editing
calmd recalculate MD/NM tags and '=' bases
fixmate fix mate information
reheader replace BAM header
targetcut cut fosmid regions (for fosmid pool only)
addreplacerg adds or replaces RG tags
markdup mark duplicates
-- File operations
collate shuffle and group alignments by name
cat concatenate BAMs
merge merge sorted alignments
mpileup multi-way pileup
sort sort alignment file
split splits a file by read group
quickcheck quickly check if SAM/BAM/CRAM file appears intact
fastq converts a BAM to a FASTQ
fasta converts a BAM to a FASTA
-- Statistics
bedcov read depth per BED region
depth compute the depth
flagstat simple stats
idxstats BAM index stats
phase phase heterozygotes
stats generate stats (former bamcheck)
-- Viewing
flags explain BAM flags
tview text alignment viewer
view SAM<->BAM<->CRAM conversion
depad convert padded BAM to unpadded BAM
Use bioSyntax to prettify your output.
samtools view aln.bam | sam-less
A BAM file is just a SAM file but stored in binary; you should always convert your SAM files into BAM to save storage space and BAM files are faster to manipulate.
To get started, view the first couple of lines of your SAM file by typing on the terminal:
head aln.sam
@SQ SN:1000000 LN:1000000
@PG ID:bwa PN:bwa VN:0.7.13-r1126 CL:bwa/bwa mem sequence/ref.fa sequence/l100_n10000_d300_31_1.fq sequence/l100_n10000_d300_31_2.fq
1:165617 99 1000000 165617 60 100M = 165917 400 TGCAGTGGTATCGGATCAGCCTAGATGCCATAGCTGAGCGCCAAATTTCCGGATTTTCCCCGTGTAGTCAATGGAGCTGTTACTTTAAGCCGTGAATGTG JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
1:165617 147 1000000 165917 60 100M = 165617 -400 AATTCGATATGCCGGTCATCGTGTGTCTATGATACTCCTTAGGCATCCCTTTAACTACGATACTTTAAGAGGTGCGAAAAGTATTCTATACGGCAGCGTA JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
2:591911 99 1000000 591911 60 100M = 592211 400 GATCAAGCTGGGCATGGGTTCGGTGACGCGTAAAAAAATTTTTTTCTGAGGACCACTGAGAAGATGGTTACGTCTAGGATCTAAGACCTAGTGTCAACTC JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
2:591911 147 1000000 592211 60 100M = 591911 -400 GCATGACACTGGATAGTGCGATTAGATAGCGGCTCGGGAGTACGTCACTGAAAGTCCAGTGCGAGAGCCAACCCGGAAACTCTACATGCGCATGTAGAAC JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
3:987691 99 1000000 987691 60 100M = 987991 400 CTCGGGACTATCTGTCAAACACTAGGGCTTGGCATAACATCTCTGAATAATAGCCAACGCGCGAGGTGTACGGGAAAAAGGAGGACCACCGCGTTATCAC JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
3:987691 147 1000000 987991 60 100M = 987691 -400 ACCGACTCGCAATTATCTCGTATCCGGGAAACTGTATAGCCGGGGGAAACTCCGATACGGACCGGCATTGGTACCAAGCGTCGAGTAGATTACCACCGAC JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
4:301460 99 1000000 301460 60 100M = 301760 400 GGACGTATACCTACTCGCCCAATTCGATCAGTGGTATCTAGTTAAGAAATAGTCTTCCTCAATTTGACTCGCCTCAACGGTTGTCTATCTGAGCTGGAAT JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
4:301460 147 1000000 301760 60 100M = 301460 -400 GTGTAGACAGGGCCATACTACTAACATCTCACAGATTAGGTTCATGTCCTCTCTAGCTCGCCAGCGCGGCTACATTTGGACTTGATACCGTTACAACGGT JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
The lines starting with the "@" sign contains the header information. The @SQ tag is the reference sequence dictionary; SN refers to the reference sequence name and LN refers to the reference sequence length. If you don't see lines starting with the "@" sign, the header information is most likely missing. If the header is absent from the SAM file use the command below, where ref.fa is the reference fasta file used to map the reads:
samtools view -bT sequence/ref.fa aln.sam > aln.bam
If the header information is available:
samtools view -bS aln.sam > aln.bam
# in the newer version of SAMtools the input format is autodetected
# hence we don't need the -S parameter any more
samtools view -b aln.sam > aln.bam
The CRAM format saves you even more disk space; disk usage of SAM = 5.5M, unordered BAM = 978K, ordered BAM = 746K, and CRAM = 91K
samtools view -T sequence/ref.fa -C -o aln.cram aln.bam
ls -hl aln.sam
-rw-r--r-- 1 dtang dtang 5.5M Feb 9 16:55 aln.sam
# unsorted BAM file
ls -hl aln.bam
-rw-r--r-- 1 dtang dtang 978K Feb 10 16:49 aln.bam
# sorted BAM file
ls -hl aln.bam
-rw-r--r-- 1 dtang dtang 746K Feb 10 16:50 aln.bam
# CRAM from sorted BAM file
ls -hl aln.cram
-rw-r--r-- 1 dtang dtang 91K Feb 10 16:52 aln.cram
I have an old blog post on the CRAM format.
Always sort your BAM files; many downstream programs only take sorted BAM files.
samtools sort aln.bam -o aln.bam
Like many Unix tools, SAMtools is able to read directly from a stream i.e. stdout.
samtools view -bS aln.sam | samtools sort - -o aln.bam
In SAMtools version 1.3 or newer, you can sort a SAM file directly.
samtools sort -o aln.bam aln.sam
In addition, you should use use additional threads, if they are available, to speed up sorting.
samtools sort -@ 8 -o aln.bam aln.sam
A BAM index file is usually needed when visualising a BAM file.
samtools index aln.bam
Note: remember to use -h to ensure the converted SAM file contains the header information. Generally, I recommend storing only sorted BAM files as they use much less disk space and are faster to process.
samtools view -h aln.bam > aln2.sam
samtools view -h -F 4 aln.bam > aln_only_mapped.sam
# output back to BAM
samtools view -h -F 4 -b aln.bam > aln_only_mapped.bam
# in the newer version of SAMtools there is the flags subcommand
# which will tell you what the flags are
samtools flags 4
0x4 4 UNMAP
If we want all reads mapping within a specific genomic region, we can use samtools view
and the ref:start-end
syntax. The name of the reference sequence in my example SAM file is 1000000. You can use just the ref
to extract an entire reference sequence such as a chromosome (example not shown here).
# index the bam file first
samtools index aln.bam
# reads mapping in the region 1000 to 1300
samtools view aln.bam 1000000:1000-1300
6144:733 147 1000000 1033 60 100M = 733 -400 CCCTCCGGGGAGGGGATCTATTAGGAAAGATTCGAGTCGGTACGTTGTATGGAACGGTTAGCACCGCCATATATCAGATTCAAAATTATCTAGGGTTTCA JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
7897:1206 99 1000000 1206 60 100M = 1506 400 TTGCCGTGTACTGGTTGAGCCATACGCAAAATTGGGATCACATTTGTATTGACGCAGGAAAACTGATGCCCGTATCTCCAGAGGTACCAGAGCGACCTGG JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
9588:1278 99 1000000 1278 60 100M = 1578 400 TATCTCCAGAGGTACCAGAGCGACCTGGACTTCAGCCAGGGAGATAGAGTATCACACTGAAGGACGGTAGCCGATAGATGAGCTGGACCTTGGATCTACT JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
# all reads mapping on 1000000 between 1000 to 1300 saved to another BAM file
samtools view -b aln.bam 1000000:1000-1300 > aln_1000_1300.bam
You can also use a BED file, with several entries, to extract reads of interest.
cat my.bed
1000000 1000 1300
1000000 2000 2300
samtools view -L my.bed aln.bam
6144:733 147 1000000 1033 60 100M = 733 -400 CCCTCCGGGGAGGGGATCTATTAGGAAAGATTCGAGTCGGTACGTTGTATGGAACGGTTAGCACCGCCATATATCAGATTCAAAATTATCTAGGGTTTCA JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
7897:1206 99 1000000 1206 60 100M = 1506 400 TTGCCGTGTACTGGTTGAGCCATACGCAAAATTGGGATCACATTTGTATTGACGCAGGAAAACTGATGCCCGTATCTCCAGAGGTACCAGAGCGACCTGG JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
9588:1278 99 1000000 1278 60 100M = 1578 400 TATCTCCAGAGGTACCAGAGCGACCTGGACTTCAGCCAGGGAGATAGAGTATCACACTGAAGGACGGTAGCCGATAGATGAGCTGGACCTTGGATCTACT JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
10:1630 147 1000000 1930 60 100M = 1630 -400 CCACCCACTACAGTAAATGAGAGAAAACGGATTAAGGTTCTAACCCTTAGCGATGACCGTTCTCACGGGTGTCTAGATACCGAGTAGAACCAACCAGCAC JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
9137:1657 147 1000000 1957 60 100M = 1657 -400 CGGATTAAGGTTCTAACCCTTAGCGATGACCGTTCTCACGGGTGTCTAGATACCGAGTAGAACCAACCAGCACCCAGCATTAAACCACAGAAGCACCGTT JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
5893:1667 147 1000000 1967 60 100M = 1667 -400 TTCTAACCCTTAGCGATGACCGTTCTCACGGGTGTCTAGATACCGAGTAGAACCAACCAGCACCCAGCATTAAACCACAGAAGCACCGTTGGTTTCTTTA JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
3617:1997 99 1000000 1997 60 100M = 2297 400 GGTGTCTAGATACCGAGTAGAACCAACCAGCACCCAGCATTAAACCACAGAAGCACCGTTGGTTTCTTTAATAGCTTCAACGCCGGCCATATCAATAAAA JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
5736:1773 147 1000000 2073 60 100M = 1773 -400 TCAACGCCGGCCATATCAATAAAAGGCTGTCGCCACATCCGTGGGTGCAAACTGAGCAGCCATGAACCTGGAAACGTTGTAGCCATGAAGATAATGCATA JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
3123:2230 99 1000000 2230 60 100M = 2530 400 CGGCACGTGCTGTCAGTATATAGTGTCGCTCATCAGGGAGGTACACCACGCGGTGGAGCATCCACGCTTTTCCCCATCTTCTATTACCTCGGCGGGGAAA JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
4758:2263 99 1000000 2263 60 100M = 2563 400 CAGGGAGGTACACCACGCGGTGGAGCATCCACGCTTTTCCCCATCTTCTATTACCTCGGCGGGGAAACAGGTAGATATGGGGGTTGGCTTGTGCAAGATA JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
3617:1997 147 1000000 2297 60 100M = 1997 -400 TTTTCCCCATCTTCTATTACCTCGGCGGGGAAACAGGTAGATATGGGGGTTGGCTTGTGCAAGATACAATTCGATAGTTGCGGGGGCTTAGATCGGCGTG JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
Sometimes you only want the first pair of a mate.
samtools view -h -f 0x0040 aln.bam > aln_first_pair.sam
0x0040 is hexadecimal for 64 (i.e. 16 * 4), which is binary for 1000000, corresponding to the read in the first read pair.
For simple statistics use samtools flagstat
.
samtools flagstat aln.bam
20000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
20000 + 0 mapped (100.00% : N/A)
20000 + 0 paired in sequencing
10000 + 0 read1
10000 + 0 read2
20000 + 0 properly paired (100.00% : N/A)
20000 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
For more stats, use samtools stats
.
samtools stats aln.bam | grep ^SN
SN raw total sequences: 20000
SN filtered sequences: 0
SN sequences: 20000
SN is sorted: 1
SN 1st fragments: 10000
SN last fragments: 10000
SN reads mapped: 20000
SN reads mapped and paired: 20000 # paired-end technology bit set + both mates mapped
SN reads unmapped: 0
SN reads properly paired: 20000 # proper-pair bit set
SN reads paired: 20000 # paired-end technology bit set
SN reads duplicated: 0 # PCR or optical duplicate bit set
SN reads MQ0: 0 # mapped and MQ=0
SN reads QC failed: 0
SN non-primary alignments: 0
SN total length: 2000000 # ignores clipping
SN bases mapped: 2000000 # ignores clipping
SN bases mapped (cigar): 2000000 # more accurate
SN bases trimmed: 0
SN bases duplicated: 0
SN mismatches: 0 # from NM fields
SN error rate: 0.000000e+00 # mismatches / bases mapped (cigar)
SN average length: 100
SN maximum length: 100
SN average quality: 41.0
SN insert size average: 400.0
SN insert size standard deviation: 0.0
SN inward oriented pairs: 10000
SN outward oriented pairs: 0
SN pairs with other orientation: 0
SN pairs on different chromosomes: 0
In addition, you can create plots from the samtools stats
output using plot-bamstats
; you need to have gnuplot
installed though. plot-bamstats
is located in the misc
directory where you downloaded samtools
.
samtools stats aln.bam > aln.stat
~/src/samtools-1.6/misc/plot-bamstats -p out/ aln.stat
ls -1 out
acgt-cycles.gp
acgt-cycles.png
coverage.gp
coverage.png
gc-content.gp
gc-content.png
gc-depth.gp
gc-depth.png
index.html
insert-size.gp
insert-size.png
quals-hm.gp
quals-hm.png
quals.gp
quals.png
quals2.gp
quals2.png
quals3.gp
quals3.png
The second column in a SAM/BAM file is the flag column. They may seem confusing at first but the encoding allows details about a read to be stored by just using a few digits. The trick is to convert the numerical digit into binary, and then use the table to interpret the binary numbers, where 1 = true and 0 = false. I wrote a blog post on BAM flags: http://davetang.org/muse/2014/03/06/understanding-bam-flags/, which also includes a Perl script for interpreting BAM flags. There is also the flags
subcommand.
samtools flags
About: Convert between textual and numeric flag representation
Usage: samtools flags INT|STR[,...]
Flags:
0x1 PAIRED .. paired-end (or multiple-segment) sequencing technology
0x2 PROPER_PAIR .. each segment properly aligned according to the aligner
0x4 UNMAP .. segment unmapped
0x8 MUNMAP .. next segment in the template unmapped
0x10 REVERSE .. SEQ is reverse complemented
0x20 MREVERSE .. SEQ of the next segment in the template is reversed
0x40 READ1 .. the first segment in the template
0x80 READ2 .. the last segment in the template
0x100 SECONDARY .. secondary alignment
0x200 QCFAIL .. not passing quality controls
0x400 DUP .. PCR or optical duplicate
0x800 SUPPLEMENTARY .. supplementary alignment
samtools flags 16
0x10 16 REVERSE
The calmd or fillmd tool is useful for visualising mismatches and insertions in an alignment of a read to a reference genome. For example:
#the -e changes identical bases between the read and reference into a bunch of =
samtools view -b aln.bam | samtools fillmd -e - sequence/ref.fa | head
@HD VN:1.3 SO:coordinate
@SQ SN:1000000 LN:1000000
@PG ID:bwa PN:bwa VN:0.7.13-r1126 CL:bwa/bwa mem sequence/ref.fa sequence/l100_n10000_d300_31_1.fq sequence/l100_n10000_d300_31_2.fq
6125:151 99 1000000 151 60 100M = 451 400 ==================================================================================================== JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
9336:174 99 1000000 174 60 100M = 474 400 ==================================================================================================== JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
1355:335 99 1000000 335 60 100M = 635 400 ==================================================================================================== JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
7456:415 99 1000000 415 60 100M = 715 400 ==================================================================================================== JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
6125:151 147 1000000 451 60 100M = 151 -400 ==================================================================================================== JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
9336:174 147 1000000 474 60 100M = 174 -400 ==================================================================================================== JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
1355:335 147 1000000 635 60 100M = 335 -400 ==================================================================================================== JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
The reads I have generated have no mismatches, so they are all displayed as equal signs.
Use BEDTools.
bedtools bamtofastq -i aln.bam -fq aln.fq
There is the fastq
subcommand in the newer version of SAMtools.
samtools fastq -1 aln_1.fq -2 aln_2.fq aln.bam
The SAMtools view -s parameter allows you to randomly sample lines of a BAM file
samtools view -s 0.5 -b aln.bam > aln_random.bam
Note that this will subsample half of the reads that mapped.
Use samtools idxstats
to print stats on a BAM file; this requires an index file which is created by running samtools index
. The reference sequence name of the example SAM file in this repository is (confusingly) called 1000000. More standard reference names are 'chr1', 'chr2', etc.
# convert to BAM and sort
samtools view -bS aln.sam | samtools sort - > aln.bam
# index
samtools index aln.bam
# output of idxstats is:
# ref name, sequence length of ref, no. mapped reads, and no. unmapped reads
samtools idxstats aln.bam
1000000 1000000 20000 0
* 0 0 0
# number of reads = mapped + unmapped
samtools idxstats aln.bam | awk '{s+=$3+$4} END {print s}'
20000
# number of mapped reads = 3rd column
samtools idxstats aln.bam | awk '{s+=$3} END {print s}'
20000
# index fasta file
samtools faidx sequence/ref.fa
# obtain sequence
samtools faidx sequence/ref.fa 1000000:10-11
>1000000:10-11
AG
# note the 1-based start
samtools faidx sequence/ref.fa 1000000:0-2
>1000000:0-2
AG
samtools faidx sequence/ref.fa 1000000:1-2
>1000000:1-2
AG