biomodern::FMIndex
is specialization of FM-Index for genome sequence which implemented using C++20. The core data structure of FMIndex
are following:
biomodern::DibitVector bwt; // a compression vector which store the bwt.
vector<array<uint32_t, 4>> occ1;
vector<array<uint8_t, 4>> occ2; // a hierarchical sampled occurrence table.
vector<uint32_t> sa; // a sampled suffix array.
vector<uint32_t> lookup; // a lookup table for fixed suffix query.
The sampling parameters are those three:
occ_intv
: occ sampling interval, default value is16
.sa_intv
: sa sampling interval, default value is1
.lookup_len
: the length of fixed suffix for lookup, default value is14
.
The memory usage for FMIndex
in run time is affect by above parameters. Take a 3.1Gb
human genome for example, the memory occupation can be calculate as follwing:
- bwt string:
3.1Gb / 4 = 0.775 Gb
. - hierarchical occurrence table: L1 occ occupy fixed
3.1Gb * 16 / 256 = 0.194Gb
plus L2 occ occupy3.1Gb * 4 / occ_intv(16) = 0.775 Gb
. - suffix array:
3.1Gb * 4 / sa_intv(1) = 12Gb
. Noticed that in default mode we dont sampling suffix value since this can reduce frequently memory allocation and intense computation when occurs massive query. - lookup table:
4^lookup_len(14) * 4 / 1024^3 = 1Gb
.
So the default total memory occupation for 3.1Gb
human genome is 0.775 Gb + 0.194Gb + 0.775Gb + 12Gb + 1Gb = 14.744Gb
.
- GCC 10.2
The FMIndex class provide very simple interface for ultra fast exact match:
using istring_view = basic_string_view<int8_t>;
// Utility for compute the suffix array from istring
static vector<uint32_t> FMIndex::get_sa(istring_view ref);
// Ctor. initialize the core data structure from reference istring.
// Noticed that this Ctor will call FMIndex::get_sa(...) to generate suffix array.
FMIndex(istring_view ref);
// Get begin and end index of the uncompressed suffix array for the input seed.
// Difference between begin and end is the occurrence count in reference.
// The third value is the remaining prefix length for the seed.
// Notice that when the stop_cnt is -1, this value always be 0.
// The stop_cnt can be using to early stop when occurrence count is not greater than the value.
// Set to -1 to forbid early stop.
tuple<uint32_t, uint32_t, uint32_t> get_range(istring_view seed, uint32_t stop_cnt);
// Compute the suffix array value according to the begin and end.
// If sa_intv is 1, this can be done at O(1).
vector<uint32_t> get_offsets(uint32_t, uint32_t);
// Utility for serialization.
// The binary binary archive file size is same as the memory occupation in run time.
void save(ofstream&);
void load(ifstream&);
#include <iostream>
#include <experimental/random>
#include "fm_index.hpp"
int main() {
using namespace biomodern::utility;
using namespace std::string_literals;
{
auto ref = istring{};
ref.reserve(3137454505);
auto fin = std::ifstream{"hs37d5.fa"};
for (auto line = ""s; std::getline(fin, line);) {
if (line.front() == '>') {
std::cout << line << "\n";
continue;
}
auto subref = Codec::to_istring(line);
// fm-index only support four characters so we need change 'N' to 'ACGT'
for (auto& c : subref) if (c == 4) c = std::experimental::randint(0, 3);
ref += subref;
}
const auto fmi = biomodern::FMIndex{ref};
auto fout = std::ofstream{"hs37d5.fmi", std::ios::binary};
fmi.save(fout);
}
auto fmi = biomodern::FMIndex{};
{
auto fin = std::ifstream{"hs37d5.fmi", std::ios::binary};
fmi.load(fin);
}
for (auto seed = istring{}; std::cin >> seed;) {
const auto [beg, end, offset] = fmi.get_range(seed, 0);
std::cout << "seed: " << seed << "\n";
std::cout << "seed offset: " << offset << "\n";
std::cout << "occurrence: " << end - beg << "\n";
const auto offsets = fmi.get_offsets(beg, end);
std::cout << "ref offsets: ";
for (const auto offset : offsets) std::cout << offset << " ";
std::cout << "\n";
}
}