kage-genotyper/kage

Error when running kage index

unavailable-2374 opened this issue ยท 21 comments

Hi,

I encountered an error while running the kage index, which seems to be an issue during the parsing of the VCF file. My input files were a chromosome from the genome and a VCF file from PGGB, which had been normalized by BCFtools.

Here is my log file.
log.txt

Best regards!

Hi!

Thanks for sharing! I'm a bit unsure what causes this problem. Does your VCF contain genotypes? Would you be able to share a small subset of the VCF, e.g the header and the first 10 variants or so?

Hi,

Sorry for the delay, here is a subset of the VCF.

Hope it's useful!
demo.vcf.txt

Sincerely,
Shuo

Thanks for providing the vcf!

I think the genotype columns in your vcf are invalid:

PN1	5147	>4637497>4637500	A	T	60	.	AC=2;AF=0.1;AN=20;AT=>4637497>4637499>4637500,>4637497>4637498>4637500;NS=20;LV=0	GT	0	.	0	0	.	.	1	0	0	0	0	0	0	0	0	1	.	.	.	.	.	.	.	.	.	.	.	.	0	.	.	.	.	.	0	0	0	0	0	0	.	.

Valid genotypes would be e.g 0|1 or 0/1, not only 0 or 1. Could something have gone wrong in your preprocessing of the VCF file?

Hello,

This VCF was produced by PGGB, which uses assemblies to construct the graph genome. Therefore there is only one genotype per sample. Or can we think of it as a purebred?

Best,
Shuo

Ah, I see. So do I understand you correctly that the individuals are actually diploid, but they are purebred so that the genotype is always homozygous, so that 0 is in principle 0/0?

If that is the case, this should work with KAGE I think, but I will need to push a minor fix for that :)

I've added a small tool in kage now that lets you convert the vcf so that genotypes are diploid before running KAGE. I think that might be an okay solution for now. You do this by running kage convert_purebread_vcf before running kage index, like this:

kage convert_purebread_vcf -v demo.vcf -o converted.vcf
kage index -v converted.vcf .....

The first command simply replaces haploid genotypes with diploid so that they are compatible with kage.

You will need to update to version 2.0.1 first with pip install kage-genotyper=2.0.1.

Note that KAGE will then genotype this as diploid, which should be fine, except that it may give out heterozygous genotypes when it thinks that is likely. This can easily be fixed in a posprocessing step by converting these to homozygous by picking the most likely homozygous genotype instead, I can fix a solution for this if you need that.

Let me know if I've misunderstood something, or if the solution does not work :)

Thank you for getting back to me so quickly!

There is one issue that may require further clarification. The mutation information for each sample is obtained from the assembly. However, since the individual is divided into hap1 and hap2 for assembly purposes, each assembly can be treated as a haploid. I am unsure if Kage can handle this situation and demonstrate the genotype by treating the haploid as a pure diploid.

Best~

I see! I think I maybe understand the VCF now. So I assume for instance BG_hap1_chr01 and BG_hap2_chr01 are the two haplotypes from the same individual? If that is the case, the best thing to do would be two group two and two haplotypes into a single diploid individual in the VCF before running KAGE.

If you can confirm that I've understood this correctly, I can fix a preprocessing script that handles this.

However, I see there are also some sample names where it is less clear whether they are from the same individual or seperate, some examples are:

Vcs201  Ved01   Ves01   Vme01   Vmu101  Vmu201  Vri01   Vsc01   Vsp01

Are these just single haploid assemblies?

Yes! That's it! But unfortunately, not all individuals were haplotype-resolved. so we get some samples named Ved and Ves, these samples are just primary assemblies.

Kindly suggest that trying to be compatible with the VCFs generated by PGGB or Minigraph-Cactus could be important. We usually use the set of variants they produce for downstream analyses ๐ŸคŸ.

Hi,

I replaced the 1 in the VCF with 1/1; 0 with 0/0; and '.' with './.'. I tried treating all loci as homozygous, but it still resulted in this error.

'''
Estimating global kmer counts: 12%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 1/8 [00:23<02:46, 23.85s/haplotype]
Traceback (most recent call last):
File "/public/tools/python/bin/kage", line 8, in
sys.exit(main())
File "/public/tools/python/lib/python3.10/site-packages/kage/command_line_interface.py", line 51, in main
run_argument_parser(sys.argv[1:])
File "/public/tools/python/lib/python3.10/site-packages/kage/command_line_interface.py", line 546, in run_argument_parser
args.func(args)
File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/main.py", line 297, in make_index_cli
r = make_index(args.reference, args.vcf, args.out_base_name,
File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/main.py", line 156, in make_index
scorer = make_kmer_scorer_from_random_haplotypes(graph, haplotype_matrix, k, n_haplotypes=8, modulo=modulo * 40 + 11)
File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/kmer_scoring.py", line 140, in make_kmer_scorer_from_random_haplotypes
for subkmers in kmers:
File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/graph.py", line 228, in get_haplotype_kmers
sequence = self.sequence(haplotype, reverse_complement=reverse_complement).ravel()
File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/graph.py", line 129, in sequence
result = zip_sequences(ref_sequence, variant_sequences, encoding)
File "/public/tools/python/lib/python3.10/site-packages/kage/util.py", line 69, in zip_sequences
new[1:-1:2] = b
File "/public/tools/python/lib/python3.10/site-packages/npstructures/raggedarray/indexablearray.py", line 102, in setitem
self._set_data_range(index, value.ravel())
File "/public/tools/python/lib/python3.10/site-packages/bionumpy/encoded_array.py", line 175, in _set_data_range
super()._set_data_range(idx, as_encoded_array(data, self._encoding).raw())
File "/public/agis/zhouyongfeng_group/caoshuo/tools/python/lib/python3.10/site-packages/bionumpy/encoded_array.py", line 597, in as_encoded_array
return target_encoding.encode(s)
File "/public/tools/python/lib/python3.10/site-packages/bionumpy/encoded_array.py", line 60, in encode
out = EncodedArray(self._encode(data), self)
File "/public/tools/python/lib/python3.10/site-packages/bionumpy/encodings/alphabet_encoding.py", line 33, in _encode
raise EncodingError(f"Error when encoding {''.join(chr(c) for c in byte_array.ravel()[0:100])} "
bionumpy.encodings.exceptions.EncodingError: ("Error when encoding CAAAAACCCGAACCTATCACTGCCCCAGTCAACTCACCAATCCTCAAAACCCCATCTGCAGAGTATTCAGAAGAAATTAATTAAGAACAGTGATCTTCTA to AlphabetEncoding. Invalid character(s): ['N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N'][78, 78, 78, 78, 78, 78, 78, 78, 78, 78]", 174333)
'''
Best~

It seems the problem now is that some of your variants contains Ns, which is not ideal as KAGE is not able to compute kmers for them. Unfortunately, the error message was not so descriptive.

I've pushed an update now that can ignore the Ns so that KAGE doesn't crash when there are Ns in variant alleles, but ideally one should filter away variants with N's in alleles before I think.

But if you try the latest version (pip install kage-genotyper==2.0.2) hopefully it should not crash now.

I'm still thinking about the best way to deal with individuals that are not haplotype-resolved. I think maybe it makes most sense to group individuals that are haplotype-resolved into diploid genotypes and ignore the rest. But if you get KAGE index to run for now, we can think more about it after that :)

Thank you for the timely update. Regarding the mutation information, does ignoring Ns mean disregarding it entirely? This may not be appropriate in the current situation. Ignoring the whole variant may result in the loss of important information. During the assembly process, except for the latest T2T genome, it is necessary to connect multiple contigs with Ns when assembling from contig to scaffold level.

What if we say that only the N character in the variant message is ignored, rather than the entire variant message, do you think that would make a bit more sense?

The fix now is not ignoring variants with Ns completely, but instead is just encoding Ns as A instead as a "hack" to be able to compute kmers. From my experience this is, although a bit dirty, usually fine as there are few variants that have alleles with Ns in them. The alternative can be to skip these variants, i.e. filter them away before running kage. I guess you have quite few variants where the alleles have Ns in them, since Ns are mostly added when scaffolding, and these Ns will end up on the reference between variants and will not affect anything. So I don't think these Ns should be much of a problem in practice.

Thank you so much ! I will test both options.

Hi

I split the VCF file by chromosome. All chromosomes were successfully indexed except for chromosome 14, which reported the following error message:

'''
(find_signatures_for_chunk_wrapper pid=82840) INFO:root:0/3 signatures removed because they had score above 254.
Traceback (most recent call last):
File "/public/tools/python/bin/kage", line 8, in
sys.exit(main())
File "/public/tools/python/lib/python3.10/site-packages/kage/command_line_interface.py", line 52, in main
run_argument_parser(sys.argv[1:])
File "/public/tools/python/lib/python3.10/site-packages/kage/command_line_interface.py", line 552, in run_argument_parser
args.func(args)
File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/main.py", line 298, in make_index_cli
r = make_index(args.reference, args.vcf, args.out_base_name,
File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/main.py", line 186, in make_index
signatures = get_signatures(k, paths, scorer, chunk_size=signatures_chunk_size, spacing=0,
File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/signatures.py", line 1068, in get_signatures
all_signatures = ray.get(all_signatures)
File "/public/tools/python/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File "/public/tools/python/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/public/tools/python/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::find_signatures_for_chunk_wrapper() (pid=82849, ip=192.168.10.248)
File "/public/tools/python/lib/python3.10/site-packages/awkward/operations/ak_sum.py", line 210, in sum
return _impl(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
File "/public/tools/python/lib/python3.10/site-packages/awkward/operations/ak_sum.py", line 277, in _impl
out = ak._do.reduce(
File "/public/tools/python/lib/python3.10/site-packages/awkward/_do.py", line 333, in reduce
next = layout._reduce_next(
File "/public/tools/python/lib/python3.10/site-packages/awkward/contents/listoffsetarray.py", line 1618, in _reduce_next
outcontent = trimmed._reduce_next(
File "/public/tools/python/lib/python3.10/site-packages/awkward/contents/listoffsetarray.py", line 1618, in _reduce_next
outcontent = trimmed._reduce_next(
File "/public/tools/python/lib/python3.10/site-packages/awkward/contents/listoffsetarray.py", line 1502, in _reduce_next
) = self._rearrange_prepare_next(outlength, parents)
File "/public/tools/python/lib/python3.10/site-packages/awkward/contents/listoffsetarray.py", line 1693, in _rearrange_prepare_next
distincts = Index64.empty(outlength * maxcount, index_nplike)
File "/public/tools/python/lib/python3.10/site-packages/awkward/index.py", line 115, in empty
return Index(nplike.empty(length, dtype=dtype), nplike=nplike)
File "/public/tools/python/lib/python3.10/site-packages/awkward/_nplikes/array_module.py", line 111, in empty
return self._module.empty(shape, dtype=dtype)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 204. GiB for an array with shape (27376795028,) and data type int64

During handling of the above exception, another exception occurred:

ray::find_signatures_for_chunk_wrapper() (pid=82849, ip=192.168.10.248)
File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/signatures.py", line 1081, in find_signatures_for_chunk_wrapper
return find_signatures_for_chunk(*params)
File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/signatures.py", line 1105, in find_signatures_for_chunk
sv_min_size=50 // max(1, spacing)).run(add_dummy_count_to_index)
File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/signatures.py", line 352, in run
self._score_signatures(add_dummy_count_to_index=add_dummy_count_to_index)
File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/signatures.py", line 334, in _score_signatures
window_scores = np.sum(scores, axis=2) # or np.min?
File "<array_function internals>", line 200, in sum
File "/public/tools/python/lib/python3.10/site-packages/awkward/highlevel.py", line 1527, in array_function
return ak._connect.numpy.array_function(
File "/public/tools/python/lib/python3.10/site-packages/awkward/_connect/numpy.py", line 102, in array_function
return function(*args, **kwargs)
File "/public/tools/python/lib/python3.10/site-packages/awkward/_connect/numpy.py", line 142, in ensure_valid_args
return function(*args, **kwargs)
File "/public/tools/python/lib/python3.10/site-packages/awkward/operations/ak_sum.py", line 298, in _nep_18_impl_sum
return sum(a, axis=axis, keepdims=keepdims)
File "/public/tools/python/lib/python3.10/site-packages/awkward/_dispatch.py", line 38, in dispatch
with OperationErrorContext(name, args, kwargs):
File "/public/tools/python/lib/python3.10/site-packages/awkward/_errors.py", line 85, in exit
self.handle_exception(exception_type, exception_value)
File "/public/tools/python/lib/python3.10/site-packages/awkward/_errors.py", line 95, in handle_exception
raise self.decorate_exception(cls, exception)
File "/public/tools/python/lib/python3.10/site-packages/awkward/_errors.py", line 119, in decorate_exception
new_exception = cls(self.format_exception(exception))
TypeError: _ArrayMemoryError.init() missing 1 required positional argument: 'dtype'
(pid=82839) /public/tools/python/lib/python3.10/site-packages/bionumpy/encodings/vcf_encoding.py:98: RuntimeWarning: invalid value encountered in cast [repeated 12x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(pid=82839) _lookup[[ord(c) for c in ('0', '1', '.')]] = np.array([0, 1, np.nan]) [repeated 12x across cluster]
'''

thanks !

Hi!

It seems the system is running out of memory: numpy.core._exceptions._ArrayMemoryError: Unable to allocate 204. GiB for an array with shape (27376795028,) and data type int64

Can I ask how many variants there are and how many individuals you have in the vcf?

Hi !

My VCF contains only structural variations, about 280,000 of them, across 27 samples, at 1.8G.

Thanks

I've looked into this now, and think there might ba possible problem with increased memories in some part of the indexing process if some variant alleles are very big. I've pushed a fix now adresses that problem and that hopefully solves it.

Could you try updating to version 2.0.3 and see if that helps? pip install kage-genotyper==2.0.3

If you still get problems with too much memory being used, I'll look further into it.

Great! Let me know if you run into any other issues while genotypig :)