vgteam/vg

Minimizer settings when using haplotype sampling

Closed this issue · 6 comments

Hello,

I have a question related to mapping very short reads (sub 40bp) while using the haplotype sampling option.

In issue #3998 it was raised that mapping extremely short reads may require constructing a minimizer index with reduced k, using vg minimizer.

Is it currently supported to do this as part of the haplotype subsampling option within vg giraffe? Since specifying a .hapl file builds the subsampled graph, as well as the min & dist indexes - it seems the option is buried to an extent.

Is there a way for users to get some control over the k values in the giraffe pipeline, for example by running the individual steps in building those sub-graph files?

thanks and best regards,
Cormac

Changing minimizer parameters is possible, but you need to do it manually. See the wiki for some documentation.

There are three commands that you need to change:

  1. Creating the .hapl file with vg haplotypes. By default, the haplotypes are descibed by the presence/absence of (k = 29, w = 11) minimizers with a single hit in the graph. You can change the minimizer parameters with options --kmer-length and --minimizer-length, but only minimizers that are unique in the graph will be used.
  2. Counting the kmers in the reads. Here you must use the same kmer length as when creating the .hapl file.
  3. Building the indexes manually. Section "Index construction and kmer counting" gives the basic commands, and you can add options -k and -w to the vg minimizer command to change the minimizer parameters. These parameters can be different than the ones used in the previous steps.

Thanks that was really helpful!

Hi Jouni,

A couple of follow up questions:

  • I think I misunderstood that you can use --minimizer-length within vg haplotypes, but I see that it is an unrecognised option.
  • I think I understand that I should instead set this one later in vg minimizer.

My current steps were:

  • vg haplotypes --kmer-length 20 --haplotype-output graph.hapl graph.gbz
  • Use kmc on the reads with k = 20
  • Use vg haplotypes to subsample the clipped gbz (I think this step is missing from the current docs but Glenn Hickey tipped me off here)
  • Run vg index and vg minimizer on the subsampled graph - in the latter step I see I have options to test -k and -w.

And finally:

  • For the actual giraffe mapping, I assume I no longer specify --haplotype-name, but instead treat it like a filtered graph, i.e. specify the subsampled .gbz, .dist, and .min files.

Does that overall sound right to you?
Thanks again,

Cormac

Running Giraffe on indexes generated as above I get the following:

warning:[vg::get_sequence_dictionary] No reference-sense paths available in the graph; falling back to generic paths.
error:[vg::get_sequence_dictionary] No reference or non-alt-allele generic paths available in the graph!

Fixed this by adding: --include-reference to the vg haplotypes subsampling call - seems to be mapping now!

The --minimizer-length option should have been --window-length.