VariantStreamer fails if ROI is supplied when loading the reference genome

Question

VariantStreamer fails if ROI is supplied when loading the reference genome

Closed this issue 4 years ago · 2 comments

Hi,

I aim to generate fasta sequences of variants present in a vcf, so I thought to use VariantStreamer class to do so.

To get strand information on the variant, I automatically generate bed annotations by retrieving the geneID where the variant occurs (from VEP annotations). I fetch genomic coordinates of such genes, thus my final annotations refer to full gene length intervals that for sure span the target variants that occur within genes.

Since loading the entire genome takes quite a while, I want to use the same annotations generated before as the ROI to the Bioseq class. Basically, that's the code and the error I get:

#custom methods to validate vcf and generate annotations
self.vcf = self._validate_vcf(vcf_file)
self.annotations = self.get_variant_annotations()
     
genome = Bioseq.create_from_refgenome('dna', refgenome=fasta_file, roi=self.annotations)
v = dna.VariantStreamer(bioseq=genome, variants=self.vcf, binsize=5, batch_size=1, annotation=self.annotations)
it_vcf = iter(v.flow())
print(next(it_vcf))

 File "test_vcf.py", line 16, in <module>
    main()
  File "test_vcf.py", line 12, in main
    obj = VariantSeq(args.vcf, args.fasta, args.out_dir)
  File "/Users/pbarbosa/git_repos/ml_genomics/vcf/variants.py", line 47, in __init__
    print(next(it_vcf))
  File "/Users/pbarbosa/miniconda3/envs/ml_genomics/lib/python3.7/site-packages/janggu/data/dna.py", line 717, in flow
    iref = self.bioseq._getsingleitem(Interval(rec.chrom, start, end)).copy()
  File "/Users/pbarbosa/miniconda3/envs/ml_genomics/lib/python3.7/site-packages/janggu/data/dna.py", line 519, in _getsingleitem
    return np.asarray(self.garray[interval][:, 0, 0])
  File "/Users/pbarbosa/miniconda3/envs/ml_genomics/lib/python3.7/site-packages/janggu/data/genomicarray.py", line 274, in __getitem__
    idx = self.region2index[_iv_to_str(chrom, interval.start, interval.end)]
KeyError: 'chr1:21889497-21889502'

If I use instead genome = Bioseq.create_from_refgenome('dna', refgenome=fasta_file, store_whole_genome=True), it works fine but it takes more than 20 minutes to load the genome in my laptop . Why is this happening, considering that the variant context fully overlaps the annotations?

I'd also like to pose some additional questions:

If I manage to work this issue out, what would it happen if I increase the binsize in the VariantStreamer constructor so that coordinates extend boundaries provided in ROI argument in the Bioseq class ? Would these sequences be padded in the missing nucleotides, or the sequence of the variant is simply discarded?
Is this class able to deal with Indels, or is it just SNVs?
Is there any method in janggu to convert back from one_hot arrays to fasta ?
Is it possible to use the predict_variant_effect method on models previously trained based on Keras and Tensorflow ? I mean, It would be nice to create a Janggu instance from previously serialised h5 models.

Thank you very much,
Best,
Pedro

Answer 1 · 2020-08-16T15:02:26.000Z

Hi Pedro,

yes, VEP currently requires to load the Bioseq object using store_whole_genome=True.

There are two solutions:

Load the genome with genome = Bioseq.create_from_refgenome('dna', refgenome=fasta_file, store_whole_genome=True, cache=True). This caches the Bioseq object and reloads it from the cache in the future, which is much faster.
I adapted the VariantStreamer(dna, ...) such that the first argument can be a Bioseq object or a reference genome in fasta format. This is currently on github, but will be part of the next pypi version. When supplying a reference genome as argument, the streamer doesn't load the whole genome, but only relevant sequence stretches overlapping the SNVs.

Regarding your questions:

For VEP, a context sequence of length binsize is extracted from the whole reference genome (such that the SNV is in the middle). So padding is only relevant at the beginning or the end of a chromosome. This is also why Bioseq needs to be loaded with store_whole_genome=True.
At the moment only single nucleotide variants are dealt with. Structural variants, for instance, are skipped.
Currently, there is no function to convert a one-hot encoding to the sequence.
Yes, there are several possibilities to use pre-trained models.
1. Within janggu you can create and fit a model using a specific name, save it with model.save() afterwards use model = Janggu.create_by_name('modelname') to reload the model.
2. If you want to wrap a pre-trained keras model into janggu, you could do model = Janggu(kerasmodel.inputs, kerasmodel.outputs).
3. If you can to use a pre-trained keras model for VEP, for convenience, I have also made a function predict_variant_effect. It performs the same functionality as Janggu.predict_variant_effect, but takes as first argument the keras model. This will also be part of the next version, but you could try it from github if you want.

I've also prepared another VEP tutorial notebook that makes use of these new features. See src/examples/variant_effect_prediction-part2.ipynb.

Best,
Wolfgang

Answer 2 · 2020-08-16T17:20:26.000Z

Awesome @wkopp, thanks for the very quick updates. I'll dig into this.

Best,
Pedro