Parse, index and get random access to FASTA files with no extra dependencies.
tinyFA provides a (relatively) fast and highly minimalist header-only library for reading FASTA files, especially in the case that you'd like random access via their FAI indices. It requires only a modern (GCC4.8 or newer) C++ compiler.
Make sure both headers are in a path accessible to your code. You can either
copy them to your include directory, or pass the directory to your compiler
with -I/path/to/tinyFA
.
Then just include both headers in your C++ code and use the TFA namespace:
#include "tinyFA/tinyfa.hpp"
#include "tinyFA/pliib.hpp"
using namespace TFA;
#include "tinyfa.hpp"
#include "pliib.hpp"
int main (int argc, char** argv){
// usage: ./getseq <fastaFile> <seqName> <start> <end>
// Parse a FASTA file and extract a subsequence.
// If a FASTA index exists, use it, otherwise,
// build one.
// The tinyFA faidx struct
tiny_faidx_t tf;
// Check if an index exists, and create one if not.
if (!checkFAIndexFileExists(argv[1])){
createFAIndex(argv[1], tf);
}
else{
// Parses an FAI file when passed a FASTA file name.
parseFAIndex(argv[1], tf);
}
char* contigName = argv[2];
int start = atoi(argv[3]);
int end = atoi(argv[4]);
char* seq;
// Not passing start/end will return the whole contig.
getSequence(tf, contigName, seq, start, end);
cout << seq << endl;
// seq gets allocated in getSequence, so you'll want to delete that.
delete [] seq;
return 0;
}
tinyFA takes a lot of code and inspiration from the excellent fastahack. Fastahack has been extensively used and we recommend it for production environments. It differs from fastahack in that:
- There's no default library setup for Fastahack. This could easily be remedied with some small makefile tweaks.
- tinyFA mostly uses structs and primitive types (rather than STL containers).
There's also htslib, but if you want to parse a FASTA file you have to build utilities for SAM/BAM/CRAM/VCF, it requires zlib, lzma, and lz4 (which are often not up to date / installed by default).
However, htslib can parse files compressed with BGZF; this may be an important addition when dealing with large files.
SeqLib and SeqAn are both excellent tools that add to htslib, with the same cons and many extra pros.