indexing error: there is no k-mer that belongs to a list of size > 128 and <= 256
Closed this issue · 9 comments
Hi!
I was trying out the new version of simpleaf (0.16.1) and piscem (0.7.0) and I ran into an error during indexing.
Installing everything through conda/mamba.
mamba create -n simpleaf_reprex -c conda-forge -c bioconda -y simpleaf==0.16.1 piscem==0.7.0
mamba activate simpleaf_reprex
Downloading the mouse genome from ENSEMBL.
# Download and unzip the genome assembly.
curl -O 'https://ftp.ensembl.org/pub/release-111/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_rm.primary_assembly.fa.gz'
gunzip 'Mus_musculus.GRCm39.dna_rm.primary_assembly.fa.gz'
# Download the genome annotation.
curl -O 'https://ftp.ensembl.org/pub/release-111/gtf/mus_musculus/Mus_musculus.GRCm39.111.chr.gtf.gz'
Running simpleaf index.
export ALEVIN_FRY_HOME='.'
simpleaf set-paths
simpleaf index \
--output index \
--overwrite \
--threads 12 \
--ref-type 'spliced+intronic' \
--fasta 'Mus_musculus.GRCm39.dna_rm.primary_assembly.fa' \
--gtf 'Mus_musculus.GRCm39.111.chr.gtf.gz' \
--rlen 100 \
--decoy-paths 'Mus_musculus.GRCm39.dna_rm.primary_assembly.fa'
The error.
[2024-01-30 15:17:45.643] [info] === step 3: 'build_index' 81.135411 [sec] (114.02275084064424 [ns/kmer])
[2024-01-30 15:17:45.643] [info] max_num_super_kmers_in_bucket 273
[2024-01-30 15:17:45.643] [info] log2_max_num_super_kmers_in_bucket 9
[2024-01-30 15:17:46.285] [info] num_buckets_in_skew_index 55 / 124009136 (4.435157100038178e-05%)
[2024-01-30 15:17:46.880] [info] num_partitions 3
[2024-01-30 15:17:46.880] [info] computing partitions...
[2024-01-30 15:17:46.880] [info] num_kmers belonging to buckets of size > 64 and <= 128: 15989
[2024-01-30 15:17:46.880] [critical] ==> Empty bucket detected:
there is no k-mer that belongs to a list of size > 128 and <= 256
fatal runtime error: Rust cannot catch foreign exceptions
====
Error: piscem index failed to build succesfully ExitStatus(unix_wait_status(134))
ohhh — thanks for reporting this @rpolicastro! This seems like an error in the underlying construction of sshash. I'll ping @jermp here as he will likely know where this is coming from and what the fix might be upstream (once we know the fix, it will be pretty easy to propagate new versions of piscem; simpleaf itself shouldn't require an update). I believe it's related to the skew data structure in sshash, and particularly to one of the skew buckets being empty! One really simple idea is, what if you try with a smaller minimizer size — like -m 17
or 15
?
Hi all,
yes, as Rob suggested, the error is due to SSHash detecting an empty bucket in the skew index.
You don't need to understand this detail; just change the minimizer length or try a different seed for the construction.
@rob-p, does piscem accept a seed argument and pass it to SSHash?
Best,
-Giulio
No, but it would be trivial to expose (the underlying piscem-cpp includes it). So I can add that and in the meantime @rpolicastro can try a different minimizer size. However, I wonder if there is a way to make the skew index itself robust to possibly empty buckets.
--Rob
Yeah, good point. I know it is annoying that this might happen.
In principle, it is possibile to avoid -- of course -- but needs a "search" that is slower. I'll think about it.
Perhaps, it is of no harm since there are very few bucket thresholds.
Well, just to keep my on track: jermp/sshash#42.
:)
Thank you both for the quick response. -m 17
did indeed fix the issue. I'm glad it was easy to diagnose!
I assume if I move forward with this setting for now it won't have any adverse effects on the downstream analysis?
Cheers,
Bob
Nope! In fact, the results should not be any different. The only distinction is that mapping may become a few % slower (may not even be noticeable). I am adding the --seed
flag to piscem
upstream, and then I will propagate it to simpleaf
as well; just a flag you can pass to change the seed for SSHash construction if you don't want to change the minimizer length.
--Rob
I didn't check with the above reprex but I just wanted to confirm with the new simpleaf version (0.16.2) that setting the minimizer length back to default -m 19
and setting some arbitrary seed --seed 100
fixed the error in my original workflow :)
w00t! I was just going to come here to ask if you could do this if it wasn't too much trouble, but you already did it! Glad to hear that just propagating a different seed fixes it :).