fastahack --- *fast* FASTA file indexing, subsequence and sequence extraction Author: Erik Garrison <erik.garrison@bc.edu>, Marth Lab, Boston College Date: May 7, 2010 Overview: fastahack is a small application for indexing and extracting sequences and subsequences from FASTA files. The included Fasta.cpp library provides a FASTA reader and indexer that can be embedded into applications which would benefit from directly reading subsequences from FASTA files. The library automatically handles index file generation and use. Features: - FASTA index (.fai) generation for FASTA files - Sequence extraction - Subsequence extraction - Sequence statistics (TODO: currently only entropy is provided) Sequence and subsequence extraction use fseek64 to provide fastest-possible extraction without RAM-intensive file loading operations. This makes fastahack a useful tool for bioinformaticists who need to quickly extract many subsequences from a reference FASTA sequence. Notes: The index files generated by this system should be numerically equivalent to those generated by samtools (http://samtools.sourceforge.net/). However, while samtools truncates sequence names in the index file, fastahack provides them completely. To simplify use, sequences can be addressed by first whitespace-separated field; e.g. "8 SN(Homo sapiens) GA(HG18) URI(NC_000008.9)" can be addressed simply as "8", provided "8" is a unique first-field name in the FASTA file. Thus, to extract 20bp starting at position 323202 in chromosome 8 from the human reference: % fastahack -r 8:323202..20 h.sapiens.fasta ACATTGTAATAGATCTCAGA Usage information is provided by running fastahack with no arguments: % usage: fastahack [options] <fasta reference> options: -i, --index generate fasta index <fasta reference>.fai -r, --region REGION print the specified region -c, --stdin read a stream of line-delimited region specifiers on stdin and print the corresponding sequence for each on stdout -e, --entropy print the shannon entropy of the specified region REGION is of the form <seq>, <seq>:<start>..<end>, <seq1>:<start>..<seq2>:<end> where start and end are 1-based, and the region includes the end position. Specifying a sequence name alone will return the entire sequence, specifying range will return that range, and specifying a single coordinate pair, e.g. <seq>:<start> will return just that base. Limitations: fastahack will only generate indexes for FASTA files in which the sequences have self-consistent line lengths. Trailing whitespace is allowed at the end of sequences, but not embedded within the sequence. These limitations are necessitated by the complexity of indexing sequences whose lines change in length--- the use of indexes is frustrated by such inconsistencies; each change in line length would require a new entry in the index file.