/bitnuc

A library for efficient nucleotide sequence manipulation using 2-bit encoding

Primary LanguageRust

bitnuc

MIT licensed actions status Crates.io docs.rs

A library for efficient nucleotide sequence manipulation using 2-bit encoding.

Features

  • 2-bit nucleotide encoding (A=00, C=01, G=10, T=11)
  • Direct bit manipulation functions for custom implementations
  • Higher-level sequence type with additional analysis features

Low-Level Packing Functions

For direct bit manipulation, use the as_2bit and from_2bit functions:

use bitnuc::{as_2bit, from_2bit};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Pack a sequence into a u64
    let packed = as_2bit(b"ACGT")?;
    assert_eq!(packed, 0b11100100);

    // Unpack back to a sequence
    let mut unpacked = Vec::new(); // Allocate a reusable buffer
    from_2bit(packed, 4, &mut unpacked)?;
    assert_eq!(&unpacked, b"ACGT");
    unpacked.clear(); // Reuse the buffer
    Ok(())
}

These functions are useful when you need to:

  • Implement custom sequence storage
  • Manipulate sequences at the bit level
  • Integrate with other bioinformatics tools
  • Copy sequences more efficiently
  • Hash sequences more efficiently

For example, packing multiple short sequences:

use bitnuc::{as_2bit, from_2bit};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Pack multiple 4-mers into u64s
    let kmers = [b"ACGT", b"TGCA", b"GGCC"];
    let packed: Vec<u64> = kmers
        .into_iter()
        .map(|kmer| as_2bit(kmer))
        .collect::<Result<_, _>>()?;

    // Unpack when needed
    let mut kmers = Vec::new();
    from_2bit(packed[0], 4, &mut kmers)?;
    assert_eq!(&kmers, b"ACGT");
    Ok(())
}

High-Level Sequence Type

For more complex sequence manipulation, use the [PackedSequence] type:

use bitnuc::{PackedSequence, GCContent, BaseCount};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let seq = PackedSequence::new(b"ACGTACGT")?;

    // Sequence analysis
    println!("GC Content: {}%", seq.gc_content());
    let [a_count, c_count, g_count, t_count] = seq.base_counts();

    // Slicing
    let subseq = seq.slice(1..5)?;
    assert_eq!(&subseq, b"CGTA");
    Ok(())
}

Memory Usage

The 2-bit encoding provides significant memory savings:

Standard encoding: 1 byte per base
ACGT = 4 bytes = 32 bits

2-bit encoding: 2 bits per base
ACGT = 8 bits

This means you can store 4 times as many sequences in the same amount of memory.

Error Handling

All operations that could fail return a [Result] with [NucleotideError]:

use bitnuc::{as_2bit, NucleotideError};

// Invalid nucleotide
let err = as_2bit(b"ACGN").unwrap_err();
assert!(matches!(err, NucleotideError::InvalidBase(b'N')));

// Sequence too long
let long_seq = vec![b'A'; 33];
let err = as_2bit(&long_seq).unwrap_err();
assert!(matches!(err, NucleotideError::SequenceTooLong(33)));

Performance Considerations

When working with many short sequences (like k-mers), using as_2bit and from_2bit directly can be more efficient than creating [PackedSequence] instances:

use bitnuc::{as_2bit, from_2bit};
use std::collections::HashMap;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Efficient k-mer counting
    let mut kmer_counts = HashMap::new();

    // Pack k-mers directly into u64s
    let sequence = b"ACGTACGT";
    for window in sequence.windows(4) {
        let packed = as_2bit(window)?;
        *kmer_counts.entry(packed).or_insert(0) += 1;
    }

    // Count of "ACGT"
    let acgt_packed = as_2bit(b"ACGT")?;
    assert_eq!(kmer_counts.get(&acgt_packed), Some(&2));
    Ok(())
}

See the documentation for [as_2bit] and [from_2bit] for more details on working with packed sequences directly.

SIMD Acceleration

as_2bit is optionally SIMD accelerated depending on the architecture of your system. By default, SIMD instructions are used, but they can be shut-off using the nosimd feature flag.

For increased performance and to really take advantage of the SIMD I recommend compiling with:

RUSTFLAGS="-C target-cpu=native"

or to add these flags to your project via the cargo build config:

# ./cargo/config.toml
[build]
rustflags = ["-C", "target-cpu=native"]

Performance characteristics on my machine vary from 10% to 30% throughput increases depending on sequence size.