HAPSTR

Assembles 2-chromosome haplotypes from short read sequecning data allowing for sequencing errors! This program employs a scoring based system that counts the number of possible 2mers from reads at each position and assembles the haplotypes using the most frequent possibility. As a result, this algorithm performs in linearly with the total number of SNPs, assuming constant sequencing coverage. A fundamental assmuption is that the correct haplotype is sequenced the most frequently. After assembling the haplotype, Hapster will offer to check the output with the true haplotype sequences, if available.

Baseline

Assembles 2-chromosome haplotypes from short read sequencing data (See data generated by datagen for formatting input). This program employs a fairly simple greedy algorithm since it assumes no sequencing errors and that every position has been sequenced with at least length 2. This algorithm performs in linear time with exactly N assignments using the entirety of the first read it encounters that sequences the position it is currently assembling, where N is the number of SNPs. The program interface is identical to Hapster.

Datagen

datagen produces 2 complementary haplotype sequences of a given length as well as simulated sequencing data for this haplotype.

Instructions for use:

Run as "python datagen.py output" in terminal

-"output" = name of output file for generated data (.txt will be concatenated automatically). If no argument passed, default will be "output".

The program will prompt the user to enter the number of SNPs and error rate (between 0 and 1).
When completed, "output.txt" will contain the sequencing data and "output_haplotypes.txt" will contain the haplotype sequences.

Number of reads per SNP is normally distributed with mean of 10, standard dev of 2, and minimum of 7. Length of each read is normally distributed with mean of 5, standard dev of 0.05, and minimum of 5. These values are somewhat arbitrary and can be changed within datagen.py. However, at least 1 read of length 2 at each position is required for Hapster to successfully assemble the haplotype.

Misc

errortest.py approximates the sequencing error rate from a read matrix given the correct haplotypes. For test purposes.

tester.py runs hapster against the baseline for a multitude of haplotype sizes and sequencing error rates, which can be specified within the code.

TODO

Rewrite hapster to use absolute file position rather than reading lines. Time complexity is not linear with large files (>25GB) due to the nature of python's readline function. (DONE, it's linear now!)
Determine the error rate threshold for hapache (Done, it's roughly 10%)
Compare hapache to some baseline. Considering HASH (Bansal et al., 2008) or HapCut (Bansal and Bafna, 2008). (Done, ended up comparing to greedy algorithm)
Write a haplotype assembler for multiple chromosomes and associated data generator

brandonjew/hapster

HAPSTR

Baseline

Datagen

Misc

TODO