Introduction ============ This repository contains the 'mcmapper' software tools for unbiased read-mapping. By unbiased, we mean that known SNPs are taken into account when mapping reads to the genome and when computing mappability (i.e. whether reads originating from a genomic location would map uniquely back to the genome). Other mapping tools do not typically take into account polymorphisms and introduce several biases: - Reads are less likely to map if they contain non-reference allele(s) especially if the reads also contain sequencing errors. This means that mapped reads will more often match the reference allele than the non-reference allele. - Reads with non-reference alleles may map to the incorrect genomic locations (with a similar sequence that better-matches the non-reference allele). - Reads that map uniquely to the reference genome may not map uniquely once polymorphisms are accounted for. - Mappability calculations do not typically consider that reads containing the non-reference allele may not map uniquely even when those containing the reference allele do. These biases are especially problematic for tests of allele-specific expression, which look for differences in the number of mapped reads that match each allele. These tests are very sensitive to mapping bias and many false positives are introduced if it is not accounted for. This bias cannot simply be overcome by masking SNPs and using a conventional read mapper. This is described in detail by Degner et al. 2009. To perform unbiased read-mapping using this software, perform the following steps (these are described in detail below): 1. Mark-up a reference genome with known SNPs 2. Build a seed index for the marked-up reference genome 3. Calculate the mappability of each position in the genome 4. Map reads to the genome 5. If desired, merge mapped read files, sort and discard duplicates 6. Filter mapped reads for mappability and indel overlap 0. Compiling / Installing ========================= libgenome: ---------- mcmapper depends on a software library called 'libgenome'. First obtain the latest version of the libgenome library from github (https://github.com/gmcvicker/genome) This library requires SCons to be built (I would like to strip out this dependency in the future). Download and install SCons if it is not already installed on your system (http://www.scons.org/). Now compile libgenome: cd genome/c scons This will create a shared library file in the directory genome/c/lib. On a UNIX system this will be named something like "libgenome.so", on MacOS X it may have an extension like ".dylib" instead of ".so". mcmapper -------- If you have not already done so, download the latest version of mcmapper from github: https://github.com/gmcvicker/mcmapper Edit mcmapper/c/Makefile so that the LIBGENOME line points to the directory containing the libgenome library compiled above. Compile the code: cd mcmapper/c make 1. Mark up a reference genome with SNPs ======================================= The provided C program 'mark_snps' can be used to markup a reference genome sequence so that it contains ambiguity characters representing known SNPs. For example positions with SNPs with alleles A and T will be changed to 'W'. The mark_snps program takes three positional arguments: ./mark_snps <input.fa.gz> <output.fa.gz> <snp_file.legend.gz> The first argument is a FASTA file, which contains the reference sequence for a single chromosome. The second argument is a path to where the new, marked-up reference sequence should be written (the program will complain if the file already exists). The third argument is a path to a file containing the SNPs for this chromosome. The SNPs are expected to be in the 'legend' format used by IMPUTE. Legend files containing 1000 Genomes SNPs can be downloaded from: http://mathgen.stats.ox.ac.uk/impute/impute_v2.html#reference The SNPs do not have to strictly follow legend format. Any file that contains the chromosome position (with first base of chromosome numbered 1) in the second column and alleles in the third and fourth columns will do. The first line in the SNP input file is assumed to be a header line and is ignored. WARNING: The markup and mapping programs currently ignore indels even though they are a major source of mapping bias. To account for this, we filter out all reads that overlap indels. This is described in section 6 below. We provide a wrapper shell script mcmapper/sh/mark_all_scripts.sh to run the markup program across all chromosomes. You should be able to modify this script to suit your needs (or to write a similar script from scratch). 2. Build a seed index ===================== The mapper requires a compressed binary seed index file, which provides the genomic positions of seed matches. The provided C program 'build_seed_index' is used to build the seed index file. This only needs to be done once for a seed of a given length (and for a given SNP-marked genome). I recommend using 12 or 13bp seeds. This program should be run on a machine with at least 16G of RAM. This program takes a minimum of 4 positional arguments: ./build_seed_index <seed_len> <chrom_info.txt> \ <output_seed_index> <ref_fasta1> <ref_fasta2> ... The first argument is the length of seed to use (in basepairs). I recommend 12 or 13. The second argument is the path to a 'chromInfo.txt' file containing the names and lengths of the chromosomes. This file can be downloaded from the UCSC genome browser. For hg19 the file is available here: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/ The third argument is the path to the file where the new seed index table should be written. The remaining arguments are to fastq files containing all of the chromosomes that are specified in the chromInfo.txt file. Note that these files should have the SNPs marked as described in section 1 above. Exmaple: ./build_seed_index 13 chromInfo.txt seed_idx.13.gz \ ~/hg19/seq_with_snps/chr*.fa.gz 3. Calculate mappability ======================== The provided 'calc_mapping_uniqueness' program is used to calculate the mapping uniqueness of reads starting at every possible genomic position. It is not necessary to run this program before mapping reads. The output from this program is used to remove biases when performing post-mapping filtering, which is described in section 6 below. This program only needs to be run once for each read length that you plan to use. The program writes a single wiggle file per chromosome containing the following integer codes for reads starting at each position: 0 - does not map (or contains N, or overhangs chromosome end) 1 - maps uniquely 2 - maps to multiple locations This program takes into account SNPs (identified by ambiguity codes) and will attempt to map all combinations of alleles. Sometimes one allele will map unquely, but the other allele will map to multiple locations. In this case, the read is flagged as mapping to multiple locations. The program arguments are: 1 path to seed table created by build_seed_index 2 path to chromInfo.txt file giving chromosome names and lengths 3 length of reads in bp 4 number of allowed mismatches 5 output directory to write wiggle files to 6+ fasta files containing genome sequence with ambiguity codes representing SNPs While you can allow variable numbers of mismatches I recommend setting this to 0, because of the way the mapper works (described in section 4 below). Here is an example of running this program to calculate mappability of 36bp reads: ./calc_mapping_uniqueness seed_idx.13.gz chromInfo.txt 36 \ 0 mappability36/ ~/hg19/seq_with_snps/chr*.fa.gz 4. Map Reads ============ The 'map_fastq' programs maps all reads in a fastq (or qseq) file to the genome. map_fastq takes several named arguments and flags followed by one or more fastq files (containing the SNP-marked reference sequence): ./map_fastq OPTIONS <ref_fastq1.fa> [<ref_fasta2.fa> ...] optional flags: --fastq read input file is fastq format (default) --qseq read input file is qseq format --out_single write output to single file and keep reads in the same order as input file (default) --out_multi write mapped reads to separate chromosome files and write unmapped and multi-mapped reads to their own files required arguments: --chrom_info <path> file with chromosome names and lengths --read_len <integer> length of reads in input file --max_mismatch <integer> maximum number of read/genome mismatches --out_dir <path> directory to write output to --seed_index <path> path to seed index for reference genome --reads_file <path> path to fastq or qseq file with reads to map After mapping, each read is assiged the following mapping code: 0 - does not map 1 - maps uniquely 2 - maps to multiple locations If the --out_multi flag is provided, reads are written to separate output files, depending on their status. Unmapped reads are written to 'unmapped.txt.gz', multiply-mapping reads are written to 'multi_mapped.txt.gz' and uniquely mapping reads are written to a single file per chromosome (e.g. chr1.mapped.txt.gz for reads that map uniquely to chr1). Reads are mapped 'progressively'. First they are mapped allowing 0 mismatches to the reference sequence. Then the number of allowed mismatches is increased until one of the following conditions is met: - the read maps uniquely (mapping code 1) - the read maps to multiple locations (mapping code 2) - the number of allowed mismatches exceeds max_mismatch Known SNPs are taken into account when mapping: bases that match the reference or non-reference allele are not considered mismatches. I recommend allowing at least a single mismatch to avoid at least some of the mapping bias introduced by unknown SNPs (e.g. variants that are private to your sample). Unfortunately unknown indels can still result in occasionally biased mapping. Reads that contain Ns are not mapped. Increasing the number of allowed mismatches increases the number of non-overlapping seeds that must be used to map each read. For this reason, (seed_len * (n_allowed_mismatches+1)) must be less than or equal to read_len (the program will complain if it is not). Allowing more mismatches will slow down the mapping. If a 13bp seed table is used, map_fastq will require about 15-20Gb of memory to map reads to the human genome. The output files have one read per line with the following space-delimited columns. Reads are reverse complemented if they map to the reverse strand so that the sequence can always be directly compared to the forward strand of the genome. 1 - read sequence (reverse complemented if read mapped to - strand) 2 - chromosome name (or . if did not map) 3 - read start position 4 - strand (+, -) 5 - mapping code (0, 1, 2) for unmapped, unique, multiple 6 - number of mismatches I would like to add an option to write mapped reads in SAM or BAM, but I have not had time to implement this. Example: ./map_fastq --seed_index seed_idx.13.gz \ --chrom_info chromInfo.txt \ --out_dir mapped_reads \ --read_len 36 --max_mismatch 1 \ --fastq --reads_file H3K4me1_18508.fastq.gz \ ~/hg19/seq_with_snps/chr*.fa.gz 5. Merge lanes, sort reads and remove duplicates ================================================ Usually it is desirable to remove "duplicate" reads that map to the same position and strand of the genome. This is because protocols such as ChIP-seq or RNA-seq oftain contain a PCR step that introduces amplification bias (some reads are sequenced many times because they amplified well). Unfortunately existing tools for removing duplicate reads (e.g. samtools rmdup) usually keep the highest scoring read. This introduces a consistent reference bias, because reads that match the reference genome will be preferentially retained. We provide a short shell script 'mcmapper/sh/merge_sort_rmdup.sh' as an example of how to combine lanes, sort reads, and discard duplicates without reference bias. This script calls the provided python program rmdup.py, which discards duplicate reads randomly (without regard to mismatches). Note that this script is just an example and contains several hardcoded paths and commands that are specific to our cluster. We would like to provide a more flexible program in the future, but in the meantime this script will need to be modified by you if you intend to use it. 6. Filtering mapped reads ========================= Two final filtering steps must be performed to remove mapping bias: 1. discard mapped reads that overlap known indels (these were not accounted for by the mapper) 2. discard reads that map to positions where both alleles are not uniquely mappable (using the mappability wiggle files created in section 3 above) The scripts that we use for this purpose have several dependencies that make them difficult to use externally. We will provide more flexible scripts in the future. For now I think it would be easiest to write your own scripts to handle this filtering. You can use the python script 'filter_mappability_indels.py' as a starting point.