Minimizer settings when using haplotype sampling
Closed this issue · 6 comments
Hello,
I have a question related to mapping very short reads (sub 40bp) while using the haplotype sampling option.
In issue #3998 it was raised that mapping extremely short reads may require constructing a minimizer index with reduced k, using vg minimizer.
Is it currently supported to do this as part of the haplotype subsampling option within vg giraffe? Since specifying a .hapl file builds the subsampled graph, as well as the min & dist indexes - it seems the option is buried to an extent.
Is there a way for users to get some control over the k values in the giraffe pipeline, for example by running the individual steps in building those sub-graph files?
thanks and best regards,
Cormac
Changing minimizer parameters is possible, but you need to do it manually. See the wiki for some documentation.
There are three commands that you need to change:
- Creating the
.hapl
file withvg haplotypes
. By default, the haplotypes are descibed by the presence/absence of (k = 29, w = 11) minimizers with a single hit in the graph. You can change the minimizer parameters with options--kmer-length
and--minimizer-length
, but only minimizers that are unique in the graph will be used. - Counting the kmers in the reads. Here you must use the same kmer length as when creating the
.hapl
file. - Building the indexes manually. Section "Index construction and kmer counting" gives the basic commands, and you can add options
-k
and-w
to thevg minimizer
command to change the minimizer parameters. These parameters can be different than the ones used in the previous steps.
Thanks that was really helpful!
Hi Jouni,
A couple of follow up questions:
- I think I misunderstood that you can use
--minimizer-length
withinvg haplotypes
, but I see that it is an unrecognised option. - I think I understand that I should instead set this one later in
vg minimizer
.
My current steps were:
vg haplotypes --kmer-length 20 --haplotype-output graph.hapl graph.gbz
- Use kmc on the reads with k = 20
- Use
vg haplotypes
to subsample the clipped gbz (I think this step is missing from the current docs but Glenn Hickey tipped me off here) - Run
vg index
andvg minimizer
on the subsampled graph - in the latter step I see I have options to test -k and -w.
And finally:
- For the actual giraffe mapping, I assume I no longer specify
--haplotype-name
, but instead treat it like a filtered graph, i.e. specify the subsampled.gbz
,.dist
, and.min
files.
Does that overall sound right to you?
Thanks again,
Cormac
Running Giraffe on indexes generated as above I get the following:
warning:[vg::get_sequence_dictionary] No reference-sense paths available in the graph; falling back to generic paths.
error:[vg::get_sequence_dictionary] No reference or non-alt-allele generic paths available in the graph!
Fixed this by adding: --include-reference
to the vg haplotypes
subsampling call - seems to be mapping now!
The --minimizer-length
option should have been --window-length
.