bioinformatics-centre/BayesTyper

Running BayesTyper on custom VCF file

Parsoa opened this issue · 11 comments

I've been trying to run BayesTyper on a VCF file (uploaded here which is basically the merged set of calls from HG00514 and HG00733 from here with duplicates removed). BayesTyper's clustering puts all the input variants into the Complex category although they are deletions and insertions. The genotyping stage also runs without errors but almost all the events are skipped (./.) and the remaining few are genotyped as '0/0'. Around 60% of the events here should get a 1/0 or 1/1 genotype.

Note that I have not combined the VCF file with BayesTyper's variation priors as I'm not interested in SNVs or any other variants. When I try to do that however, all of my input events are skipped and the combined set includes only events from the prior.

This is the command I use for running BayesTyper:

bayesTyper cluster -v  HG00514_HG00733.merged_nonredundant.unified.sorted.vcf -s samples.tsv -g /data/BayesTyper/bayestyper_GRCh38_bundle_v1.3/GRCh38_canon.fa -d /data/BayesTyper/bayestyper_GRCh38_bundle_v1.3/GRCh38_decoy.fa -p 16
bayesTyper genotype -v bayestyper_unit_1/variant_clusters.bin -c bayestyper_cluster_data -s samples.tsv -g /data/BayesTyper/bayestyper_GRCh38_bundle_v1.3/GRCh38_canon.fa -d /data/BayesTyper/bayestyper_GRCh38_bundle_v1.3/GRCh38_decoy.fa -o bayestyper_unit_1/bayestyper -z -p 4

Am I missing anything here? Thanks.

That is weird. Did you remember to convert the symbolic alleles in the input vcf to sequences before running cluster? You can do this using bayesTyperTools convertAllele.

If that is not the problem would it then be possible to share the output from BayesTyper (both stdout and vcf file)?

Thanks for your response. I didn't know about the convertAllele bit. Now after running that the events are properly categorized but the genotyping step still fails to produce meaningful genotypes for about half of the events. The VCF file with the correct alleles is here and this is the genotyping output. The stdout is below:


[21/05/2020 13:39:29] You are using BayesTyper (v1.5)

[21/05/2020 13:39:29] Seeding pseudo-random number generator with 1590093569 ...
[21/05/2020 13:39:29] Setting the kmer size to 55 ...

[21/05/2020 13:39:29] Parsed information for 1 sample(s)

[21/05/2020 13:39:29] Parsing reference genome ...
[21/05/2020 13:39:40] Parsed 65 reference genome chromosomes(s) (3095211400 nucleotides)

[21/05/2020 13:39:40] Parsing decoy sequence(s) ...
[21/05/2020 13:39:40] Parsed 2515 decoy sequence(s) (10503663 nucleotides)

[21/05/2020 13:40:03] Setting the number of inference units to 1 across 11882 variants ...

[21/05/2020 13:40:21] Maximum resident set size: 13.5074 Gb


[21/05/2020 13:40:21] Parsing variants in unit 1 ...
[21/05/2020 13:40:26] Parsed 11882 variants as 11765 clusters

[21/05/2020 13:40:27] Finding variant cluster paths for 1 sample(s) ...
[21/05/2020 13:41:47] Counting multigroup kmers in variant cluster paths ...

[21/05/2020 13:41:58] Wrote unit 1 variant clusters to bayestyper_unit_1/variant_clusters.bin

[21/05/2020 13:41:58] Maximum resident set size: 45.7956 Gb


[21/05/2020 13:41:58] Parsed 11882 alternative alleles across 1 units and excluded:

        - Alleles on a decoy sequence: 0
        - Alleles on a chromosome not in the reference genome: 0
        - Alleles with a reference not equal to the reference genome sequence: 0
        - Alleles within 55 bases of a chromosome end: 0
        - Alleles longer than 500000 bases: 0

[21/05/2020 13:41:58] Out of 11882 variants:

        - Single nucleotide variant: 0
        - Insertion: 7059
        - Deletion: 4823
        - Complex: 0
        - Mixture: 0
        - Unsupported (excluded): 0

[21/05/2020 13:41:58] Variants merged into 11765 clusters and further into 11736 groups

[21/05/2020 13:41:58] Maximum resident set size: 45.7956 Gb


[21/05/2020 13:41:58] Writing inter-cluster regions ...
[21/05/2020 13:41:58] Wrote 14336 regions to bayestyper_cluster_data/intercluster_regions.txt.gz

[21/05/2020 13:42:00] Counting parameter kmers in inter-cluster regions ...
[21/05/2020 13:47:39] Wrote 1000000 kmers to bayestyper_cluster_data/parameter_kmers.fa.gz

[21/05/2020 13:47:39] Creating multigroup kmers bloom filter ...
[21/05/2020 13:47:41] Wrote 233759 kmers to bayestyper_cluster_data/multigroup_kmers.bloom[Meta|Data]

[21/05/2020 13:47:41] Maximum resident set size: 45.7956 Gb


[21/05/2020 13:47:41] BayesTyper cluster completed succesfully!


[21/05/2020 13:47:47] You are using BayesTyper (v1.5)

[21/05/2020 13:47:47] Seeding pseudo-random number generator with 1590094067 ...
[21/05/2020 13:47:47] Setting the kmer size to 55 ...

[21/05/2020 13:47:47] Parsed information for 1 sample(s)

[21/05/2020 13:47:47] Parsing reference genome ...
[21/05/2020 13:47:59] Parsed 65 reference genome chromosomes(s) (3095211400 nucleotides)

[21/05/2020 13:47:59] Parsing decoy sequence(s) ...
[21/05/2020 13:47:59] Parsed 2515 decoy sequence(s) (10503663 nucleotides)

[21/05/2020 13:48:10] Maximum resident set size: 3.28414 Gb


[21/05/2020 13:48:10] Parsing variant clusters ...
[21/05/2020 13:48:11] Parsed 11765 variant clusters (11882 variants)

[21/05/2020 13:48:13] Parsing parameter kmers ...
[21/05/2020 13:48:15] Parsed 1000000 kmers

[21/05/2020 13:48:15] Maximum resident set size: 5.07988 Gb


[21/05/2020 13:48:15] Counting kmers in variant cluster paths ...
[21/05/2020 13:48:23] Counting kmers in inter-cluster regions and decoy sequence(s) ...

[21/05/2020 14:02:25] Parsing KMC table containing 18378611763 kmers for sample NA19240 ...

[21/05/2020 15:28:23] Classifying kmers in variant cluster paths ...
[21/05/2020 15:28:32] Out of 8329103 kmers:

        - 5915592 have a match to a single variant cluster
        - 32408 have a match to single variant cluster group and multiple variant clusters

        - 863414 have match to at least one variant cluster and has match to a decoy sequence (not used for inference)
        - 29212 have match to at least one variant cluster and has a maximum haploid multiplicity higher than 127 (not used for inference)
        - 180770 have matches to multiple variant cluster groups within or across inference units (not used for inference)

        - 1307707 have no match to a variant cluster (includes parameter kmers)

[21/05/2020 15:28:32] Maximum resident set size: 5.85634 Gb


[21/05/2020 15:28:32] Estimating genomic haploid kmer count distribution(s) from parameter kmers ...

[21/05/2020 15:28:32] Estimated negative binomial (mean = 18.473, var = 148.322) for sample NA19240 using 930793 parameter kmers (multiplicity = 2)

[21/05/2020 15:28:33] Wrote genomic parameters to bayestyper_unit_1/bayestyper_genomic_parameters.txt

[21/05/2020 15:28:33] Maximum resident set size: 5.85634 Gb


[21/05/2020 15:28:33] Estimating noise model parameters using 20 independent gibbs sampling chains each with 350 iterations (100 burn-in) ...
[21/05/2020 15:40:37] Calculated final noise model parameters by averaging 5000 parameter estimates (250 per gibbs sampling chain)

WARNING: Low number of variants used for Poisson parameter estimation (11831 < 100000)
WARNING: Try using the noise genotyping mode (--noise-genotyping) if noise rate estimates are not stable between chains

[21/05/2020 15:40:37] Wrote noise parameters to bayestyper_unit_1/bayestyper_noise_parameters.txt

[21/05/2020 15:40:37] Maximum resident set size: 6.4997 Gb


[21/05/2020 15:40:37] Estimating genotypes using 20 independent gibbs sampling chains each with 350 iterations (100 burn-in) ...


[21/05/2020 15:41:33] Sorting genotyped variants ...
[21/05/2020 15:41:35] Wrote genotyped variants to bayestyper_unit_1/bayestyper.vcf.gz

[21/05/2020 15:41:35] Out of 11882 variants:

        - 11882 were genotyped
        - 0 were skipped (unsupported)

[21/05/2020 15:41:35] Maximum resident set size: 6.4997 Gb


[21/05/2020 15:41:35] BayesTyper genotype completed succesfully!

It looks like you have a lot of zero-inflation in your k-mer counts (see the high estimated variance). This also results in a lot of the genotypes being filtered due to a high number of zero count k-mers across the variant (format attribute SAF=2). I am not sure why that is the case. Does the input reads have a high error rate? Or are you using a non-random subset of the reads/kmer? It could also be that you are missing smaller variants around or in the SVs, but I would not expect that to be that severe.

You could try to remove the filter using the --disable-observed-kmers option. That should solve the filtering problem, but will likely result in more incorrect genotypes.

Also, given the low number of input variant I would recommend running with the option --noise-genotyping. That enables joint inference of noise model parameters and genotypes.

Please let me know if you run into any other problems.

Come to think about it. Incorrect SV breakpoints could also result in inflated zero count kmers. Do you see that more deletions are filtered compared to insertions? However, that would only explain the filtering and not the really high kmer count variance.

I'm not doing any filtering on the reads or kmers, however I generally don't expect these breakpoints to be exact to the basepair.

I tried rerunning with the above flags and now much more gets genotyped. Around 2800 (1000 deletions, 1800 insertions) are still missing genotypes, but accuracy among those that have been genotyped is somewhat reasonable (~83%).

BayesTyper does not perform well when the breakpoints are not exact due to its use of exact matching kmers. If you can you share the latest results then I will take a look the remaining filtered variants.

Thanks for your time! The new results are here.

Btw I combined the previous events with the prior events and genotyped the whole set, but now I'm only getting ~1500 of my events in the final genotype set (based on ACO value or bedtools intersect with my original coordinates). And the breakpoints seem to have been adjusted for most of the events. How come is it that I can't find all of my events although none were skipped as per the genotyping stage's output? (the combined set before genotyping had all of the ~12k original events) Thanks.

Thanks, I am happy to help. Would it be possible to also share the log files (stdout) from BayesTyper?

Here is the stdout for the clustering and genotyping stages:

[01/06/2020 01:21:38] You are using BayesTyper (v1.5)

[01/06/2020 01:21:38] Seeding pseudo-random number generator with 1590999698 ...
[01/06/2020 01:21:38] Setting the kmer size to 55 ...

[01/06/2020 01:21:38] Parsed information for 1 sample(s)

[01/06/2020 01:21:38] Parsing reference genome ...
[01/06/2020 01:21:50] Parsed 65 reference genome chromosomes(s) (3095211400 nucleotides)

[01/06/2020 01:21:50] Parsing decoy sequence(s) ...
[01/06/2020 01:21:50] Parsed 2515 decoy sequence(s) (10503663 nucleotides)

[01/06/2020 01:22:08] Setting the number of inference units to 9 across 49591714 variants ...

[01/06/2020 01:22:23] Maximum resident set size: 13.5074 Gb


[01/06/2020 01:22:23] Parsing variants in unit 1 ...
[01/06/2020 01:24:32] Parsed 5510191 variants as 2029456 clusters

[01/06/2020 01:25:04] Finding variant cluster paths for 1 sample(s) ...
[01/06/2020 01:31:59] Counting multigroup kmers in variant cluster paths ...

[01/06/2020 01:42:45] Wrote unit 1 variant clusters to bayestyper_unit_1/variant_clusters.bin

[01/06/2020 01:42:59] Maximum resident set size: 54.1727 Gb


[01/06/2020 01:42:59] Parsing variants in unit 2 ...
[01/06/2020 01:45:11] Parsed 5510193 variants as 2070888 clusters

[01/06/2020 01:45:44] Finding variant cluster paths for 1 sample(s) ...
[01/06/2020 01:53:43] Counting multigroup kmers in variant cluster paths ...

[01/06/2020 02:20:06] Wrote unit 2 variant clusters to bayestyper_unit_2/variant_clusters.bin

[01/06/2020 02:20:22] Maximum resident set size: 54.6393 Gb


[01/06/2020 02:20:22] Parsing variants in unit 3 ...
[01/06/2020 02:23:37] Parsed 5510191 variants as 1991158 clusters

[01/06/2020 02:24:08] Finding variant cluster paths for 1 sample(s) ...
[01/06/2020 02:33:00] Counting multigroup kmers in variant cluster paths ...

[01/06/2020 02:45:20] Wrote unit 3 variant clusters to bayestyper_unit_3/variant_clusters.bin

[01/06/2020 02:45:40] Maximum resident set size: 54.8633 Gb


[01/06/2020 02:45:40] Parsing variants in unit 4 ...
[01/06/2020 02:49:34] Parsed 5510494 variants as 1954976 clusters

[01/06/2020 02:50:04] Finding variant cluster paths for 1 sample(s) ...
[01/06/2020 02:57:24] Counting multigroup kmers in variant cluster paths ...

[01/06/2020 03:12:49] Wrote unit 4 variant clusters to bayestyper_unit_4/variant_clusters.bin

[01/06/2020 03:13:02] Maximum resident set size: 55.0972 Gb


[01/06/2020 03:13:02] Parsing variants in unit 5 ...
[01/06/2020 03:16:46] Parsed 5510193 variants as 1944057 clusters

[01/06/2020 03:17:14] Finding variant cluster paths for 1 sample(s) ...
[01/06/2020 03:25:42] Counting multigroup kmers in variant cluster paths ...

[01/06/2020 03:46:36] Wrote unit 5 variant clusters to bayestyper_unit_5/variant_clusters.bin

[01/06/2020 03:46:48] Maximum resident set size: 55.576 Gb


[01/06/2020 03:46:48] Parsing variants in unit 6 ...
[01/06/2020 03:50:13] Parsed 5510190 variants as 1962515 clusters

[01/06/2020 03:50:42] Finding variant cluster paths for 1 sample(s) ...
[01/06/2020 03:57:16] Counting multigroup kmers in variant cluster paths ...

[01/06/2020 04:12:02] Wrote unit 6 variant clusters to bayestyper_unit_6/variant_clusters.bin

[01/06/2020 04:12:14] Maximum resident set size: 55.6417 Gb


[01/06/2020 04:12:14] Parsing variants in unit 7 ...
[01/06/2020 04:15:55] Parsed 5510206 variants as 2004681 clusters

[01/06/2020 04:16:24] Finding variant cluster paths for 1 sample(s) ...
[01/06/2020 04:24:16] Counting multigroup kmers in variant cluster paths ...

[01/06/2020 04:39:08] Wrote unit 7 variant clusters to bayestyper_unit_7/variant_clusters.bin

[01/06/2020 04:39:20] Maximum resident set size: 55.7417 Gb


[01/06/2020 04:39:20] Parsing variants in unit 8 ...
[01/06/2020 04:43:18] Parsed 5510190 variants as 1898938 clusters

[01/06/2020 04:43:45] Finding variant cluster paths for 1 sample(s) ...
[01/06/2020 04:53:40] Counting multigroup kmers in variant cluster paths ...

[01/06/2020 05:07:09] Wrote unit 8 variant clusters to bayestyper_unit_8/variant_clusters.bin

[01/06/2020 05:07:20] Maximum resident set size: 55.9301 Gb


[01/06/2020 05:07:20] Parsing variants in unit 9 ...
[01/06/2020 05:10:50] Parsed 5509866 variants as 2078527 clusters

[01/06/2020 05:11:23] Finding variant cluster paths for 1 sample(s) ...
[01/06/2020 05:19:13] Counting multigroup kmers in variant cluster paths ...

[01/06/2020 05:37:51] Wrote unit 9 variant clusters to bayestyper_unit_9/variant_clusters.bin

[01/06/2020 05:38:03] Maximum resident set size: 56.1484 Gb


[01/06/2020 05:38:03] Parsed 55636763 alternative alleles across 9 units and excluded:

        - Alleles on a decoy sequence: 51
        - Alleles on a chromosome not in the reference genome: 0
        - Alleles with a reference not equal to the reference genome sequence: 0
        - Alleles within 55 bases of a chromosome end: 0
        - Alleles longer than 500000 bases: 65

[01/06/2020 05:38:03] Out of 49591714 variants:

        - Single nucleotide variant: 33294776
        - Insertion: 4950672
        - Deletion: 9284381
        - Complex: 252760
        - Mixture: 1809037
        - Unsupported (excluded): 88

[01/06/2020 05:38:03] Variants merged into 17935196 clusters and further into 13689200 groups

[28/05/2020 13:15:55] You are using BayesTyper (v1.5)

[28/05/2020 13:15:55] Seeding pseudo-random number generator with 1590696955 ...
[28/05/2020 13:15:55] Setting the kmer size to 55 ...

[28/05/2020 13:15:55] Parsed information for 1 sample(s)

[28/05/2020 13:15:55] Parsing reference genome ...
[28/05/2020 13:16:08] Parsed 65 reference genome chromosomes(s) (3095211400 nucleotides)

[28/05/2020 13:16:08] Parsing decoy sequence(s) ...
[28/05/2020 13:16:08] Parsed 2515 decoy sequence(s) (10503663 nucleotides)

[28/05/2020 13:16:20] Maximum resident set size: 3.28397 Gb


[28/05/2020 13:16:20] Parsing variant clusters ...
[28/05/2020 13:17:22] Parsed 2029456 variant clusters (5510191 variants)

[28/05/2020 13:17:38] Parsing parameter kmers ...
[28/05/2020 13:17:44] Parsed 1000000 kmers

[28/05/2020 13:17:44] Maximum resident set size: 27.8975 Gb


[28/05/2020 13:17:44] Counting kmers in variant cluster paths ...
[28/05/2020 13:39:07] Counting kmers in inter-cluster regions and decoy sequence(s) ...

[28/05/2020 13:46:59] Parsing KMC table containing 18378611763 kmers for sample NA19240 ...

[28/05/2020 15:37:55] Classifying kmers in variant cluster paths ...
[28/05/2020 16:01:30] Out of 239200347 kmers:

	- 203827734 have a match to a single variant cluster
	- 19773651 have a match to single variant cluster group and multiple variant clusters

	- 335313 have match to at least one variant cluster and has match to a decoy sequence (not used for inference)
	- 19907 have match to at least one variant cluster and has a maximum haploid multiplicity higher than 127 (not used for inference)
	- 13007289 have matches to multiple variant cluster groups within or across inference units (not used for inference)

	- 2236453 have no match to a variant cluster (includes parameter kmers)

[28/05/2020 16:01:30] Maximum resident set size: 28.8729 Gb


[28/05/2020 16:01:30] Estimating genomic haploid kmer count distribution(s) from parameter kmers ...

[28/05/2020 16:01:30] Estimated negative binomial (mean = 19.9629, var = 113.103) for sample NA19240 using 942301 parameter kmers (multiplicity = 2)

[28/05/2020 16:01:32] Wrote genomic parameters to bayestyper_unit_1/bayestyper_genomic_parameters.txt

[28/05/2020 16:01:32] Maximum resident set size: 28.8729 Gb


[28/05/2020 16:01:32] Estimating noise model parameters and genotypes using 20 independent gibbs sampling chains each with 350 iterations (100 burn-in) ...

[28/05/2020 17:32:10] Finished 1 gibbs sampling chain
[28/05/2020 18:43:49] Finished 2 gibbs sampling chains
[28/05/2020 19:56:28] Finished 3 gibbs sampling chains
[28/05/2020 21:10:09] Finished 4 gibbs sampling chains
[28/05/2020 22:34:46] Finished 5 gibbs sampling chains
[28/05/2020 23:58:03] Finished 6 gibbs sampling chains
[29/05/2020 01:11:07] Finished 7 gibbs sampling chains
[29/05/2020 02:24:17] Finished 8 gibbs sampling chains
[29/05/2020 03:37:27] Finished 9 gibbs sampling chains
[29/05/2020 04:51:51] Finished 10 gibbs sampling chains
[29/05/2020 06:06:45] Finished 11 gibbs sampling chains
[29/05/2020 07:21:48] Finished 12 gibbs sampling chains
[29/05/2020 08:36:01] Finished 13 gibbs sampling chains
[29/05/2020 09:51:05] Finished 14 gibbs sampling chains
[29/05/2020 11:05:14] Finished 15 gibbs sampling chains
[29/05/2020 12:20:06] Finished 16 gibbs sampling chains
[29/05/2020 13:34:35] Finished 17 gibbs sampling chains
[29/05/2020 14:49:31] Finished 18 gibbs sampling chains
[29/05/2020 16:04:00] Finished 19 gibbs sampling chains
[29/05/2020 17:18:13] Finished 20 gibbs sampling chains

[29/05/2020 17:18:13] Wrote noise parameters to bayestyper_unit_1/bayestyper_noise_parameters.txt

[29/05/2020 17:18:13] Collecting genotypes ...

[29/05/2020 17:29:43] Sorting genotyped variants ...
[29/05/2020 17:31:14] Wrote genotyped variants to bayestyper_unit_1/bayestyper.vcf.gz

[29/05/2020 17:31:15] Out of 5510191 variants:

	- 5510184 were genotyped
	- 7 were skipped (unsupported)

[29/05/2020 17:31:15] Maximum resident set size: 142.625 Gb


[29/05/2020 17:31:15] BayesTyper genotype completed succesfully!

Thanks again.

BayesTyper does not perform well when the breakpoints are not exact due to its use of exact matching kmers. If you can you share the latest results then I will take a look the remaining filtered variants.

Hi all,

I met the same problem that most events are skipped (./.). I do not think we can get the exact breakpoint for most of SVs by only use the short reads sequencing. If this problem can be solved, it will be highly welcome.

Sincerely,
Zhuqing

@Parsoa I have looked at the results and most of the remaining filtered genotypes are due to a low genotype posterior probability (GPP). By default BayesTyper filters all genotypes with a posterior below 0.99. You could try to decrease this threshold if you are interested in the remaining genotypes, but be aware that the results will be more uncertain. You can either rerun bayesTyper genotype or run bayesTyperTools filter to re-filter the genotypes (https://github.com/bioinformatics-centre/BayesTyper/wiki/Filtering). The remaining filtered genotypes are due to insufficient unique kmers resulting from repetitive regions. These there is not much to do about.

Regarding the combined set. bayesTyperTools combine tries to simplify and remove redundant variants when combining multiple sets. Basically if a variant in one set can be explained by two (or more) variants in another set it will be filtered out and the its ACO value added to the two variants.