snpdiffrs finds differential single nucleotide polymorphisms between BAM files (of the same species).
Create an input.toml
containing, at minimum, the following
output_dir = 'tests/A_vs_B'
[samples]
A = ['path_to/A.bam']
B = ['path_to/B.bam']
and run using snpdiffrs input.toml
.
See Modes for details and other run modes..
The output files are tab-seperated-value files, with one row per SNP found.
Column | Description |
---|---|
chr | chromosome/BAM-reference |
pos | reference position (0-based) |
score | -1 * loge-likelihood according to internal model |
A_A | Count of adenine in 1st sample |
B_A | Count of adenine in 2nd sample |
A_C | Count of cytosine in 1st sample |
B_C | Count of cytosine in 2nd sample |
A_G | Count of guanine in 1st sample |
B_C | Count of guanine in 2nd sample |
A_T | Count of thymine in 2nd sample |
B_T | Count of thymine in 1st sample |
haplotypeA | Most likely haplotype (diploid) of 1st sample |
haplotypeB | Most likely haplotype (diploid) of 2nd sample |
Option | Type | Description |
---|---|---|
mode | enum | n_to_n, 1_to_n, quantify_homopolymers. Default: n_to_n. See modes |
output\_dir | string | Write output files here. Directory will be created if missing. Files will be overwritten. |
chromosomes | list | - restrict comparison to these chromosomes/BAM-references |
quality_threshold | float | Base quality threshold (PHRED) applied before counting. Default 15. |
filter_homo_polymer_threshold | int | Exclude reads that have a homopolymer of length >= filter_homo_polymer_threshold. Default: No such filtering. |
min_score | float | Minimum score (see output) to filter for. Default 50.0 |
ncores | int | How many CPU cores to use. Default: all of them. |
block_size | int | Size of coverage region examined at once. Trade-off between disk seeks, RAM usage. Setting this too high will impair parallel computation though. Default 50_000_000 |
Example:
mode = 'n_to_n'
output_dir = 'tests/A_vs_B'
[samples]
A = ['path_to/A.bam']
B = ['path_to/B.bam']
All samples will be compared vs each other, and you will receive one output file A_vs_B.tsv in your output directory.
Each pair will be present only once, with the sample names sorted in lexographical order.
Example:
mode = 'OneToN'
output_dir = 'tests/X_vs_all'
[query]
X = ['path_to/X.bam', 'path_to/X2.bam'] # multiple BAMs per sample will be aggregated
[references]
B = [...]
C = [...]
D = [...]
Compares one sample vs all others (e.g. to determine contamination). Output files named like X_vs_B.tsv, X_vs_C.tsv, etc.
Example:
mode = 'QuantifyHomopolymers'
output_dir = 'tests/quantification'
[samples]
A = ['A.bam'] # multiple BAMs per sample will be aggregated
B = ['B.bam'] # multiple BAMs per sample will be aggregated
...
Provide a tab-seperated-value of reads-with-a-(maximum)-homopolymer-length
histogram per sample named like A_homopolymer_histogram.tsv
.
Some sequencing runs apperantly create lot's of 'fake' homopolymers, which lead to massive false SNP calls. You can try to use the QuantifyHomopolymers mode to gain insight, I've had better luck simply comparing a run with a modest filter (e.g. filter homopolymers length >= 8) vs no homopolymer filtering, and looking at the ratio of called snps for sample pairs.