COMBINE-lab/simpleaf

indexing error: there is no k-mer that belongs to a list of size > 128 and <= 256

Closed this issue · 9 comments

Hi!

I was trying out the new version of simpleaf (0.16.1) and piscem (0.7.0) and I ran into an error during indexing.

Installing everything through conda/mamba.

mamba create -n simpleaf_reprex -c conda-forge -c bioconda -y simpleaf==0.16.1 piscem==0.7.0
mamba activate simpleaf_reprex

Downloading the mouse genome from ENSEMBL.

# Download and unzip the genome assembly.
curl -O 'https://ftp.ensembl.org/pub/release-111/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_rm.primary_assembly.fa.gz'
gunzip 'Mus_musculus.GRCm39.dna_rm.primary_assembly.fa.gz'

# Download the genome annotation.
curl -O 'https://ftp.ensembl.org/pub/release-111/gtf/mus_musculus/Mus_musculus.GRCm39.111.chr.gtf.gz'

Running simpleaf index.

export ALEVIN_FRY_HOME='.'
simpleaf set-paths

simpleaf index \
    --output index \
    --overwrite \
    --threads 12 \
    --ref-type 'spliced+intronic' \
    --fasta 'Mus_musculus.GRCm39.dna_rm.primary_assembly.fa' \
    --gtf 'Mus_musculus.GRCm39.111.chr.gtf.gz' \
    --rlen 100 \
    --decoy-paths 'Mus_musculus.GRCm39.dna_rm.primary_assembly.fa'

The error.

[2024-01-30 15:17:45.643] [info] === step 3: 'build_index' 81.135411 [sec] (114.02275084064424 [ns/kmer])
[2024-01-30 15:17:45.643] [info] max_num_super_kmers_in_bucket 273
[2024-01-30 15:17:45.643] [info] log2_max_num_super_kmers_in_bucket 9
[2024-01-30 15:17:46.285] [info] num_buckets_in_skew_index 55 / 124009136 (4.435157100038178e-05%)
[2024-01-30 15:17:46.880] [info] num_partitions 3
[2024-01-30 15:17:46.880] [info] computing partitions...
[2024-01-30 15:17:46.880] [info] num_kmers belonging to buckets of size > 64 and <= 128: 15989
[2024-01-30 15:17:46.880] [critical] ==> Empty bucket detected:
there is no k-mer that belongs to a list of size > 128 and <= 256
fatal runtime error: Rust cannot catch foreign exceptions
====
Error: piscem index failed to build succesfully ExitStatus(unix_wait_status(134))
rob-p commented

ohhh — thanks for reporting this @rpolicastro! This seems like an error in the underlying construction of sshash. I'll ping @jermp here as he will likely know where this is coming from and what the fix might be upstream (once we know the fix, it will be pretty easy to propagate new versions of piscem; simpleaf itself shouldn't require an update). I believe it's related to the skew data structure in sshash, and particularly to one of the skew buckets being empty! One really simple idea is, what if you try with a smaller minimizer size — like -m 17 or 15?

jermp commented

Hi all,
yes, as Rob suggested, the error is due to SSHash detecting an empty bucket in the skew index.
You don't need to understand this detail; just change the minimizer length or try a different seed for the construction.
@rob-p, does piscem accept a seed argument and pass it to SSHash?

Best,
-Giulio

rob-p commented

@jermp,

No, but it would be trivial to expose (the underlying piscem-cpp includes it). So I can add that and in the meantime @rpolicastro can try a different minimizer size. However, I wonder if there is a way to make the skew index itself robust to possibly empty buckets.

--Rob

jermp commented

Yeah, good point. I know it is annoying that this might happen.
In principle, it is possibile to avoid -- of course -- but needs a "search" that is slower. I'll think about it.
Perhaps, it is of no harm since there are very few bucket thresholds.

jermp commented

Well, just to keep my on track: jermp/sshash#42.
:)

Thank you both for the quick response. -m 17 did indeed fix the issue. I'm glad it was easy to diagnose!

I assume if I move forward with this setting for now it won't have any adverse effects on the downstream analysis?

Cheers,
Bob

rob-p commented

@rpolicastro,

Nope! In fact, the results should not be any different. The only distinction is that mapping may become a few % slower (may not even be noticeable). I am adding the --seed flag to piscem upstream, and then I will propagate it to simpleaf as well; just a flag you can pass to change the seed for SSHash construction if you don't want to change the minimizer length.

--Rob

I didn't check with the above reprex but I just wanted to confirm with the new simpleaf version (0.16.2) that setting the minimizer length back to default -m 19 and setting some arbitrary seed --seed 100 fixed the error in my original workflow :)

rob-p commented

w00t! I was just going to come here to ask if you could do this if it wasn't too much trouble, but you already did it! Glad to hear that just propagating a different seed fixes it :).