Develop a VEP for non-human organisms
Closed this issue · 1 comments
Thanks for developing the application! GPN-MSA seems to be a great approach for predicting variant effects, especially for variants outside protein coding regions. I have a quick question about its application on non-human organisms: based on my understanding of the process, if I'd like to build a VEP for a non-human organism (our target organism), I need to use the MSA data set and re-train a model for the target organism, then run the "VEP" process on that model. Is this the correct way to do it?
Hello, thanks for your interest! That would be the right approach. The challenge is getting the MSA. The current code starts from an alignment in MAF format referenced to the target organism. This is available for many organisms in UCSC Genome Browser downloads (see Multiple alignments
).
One of the annoying aspects about MAF format is that it is referenced to a single species, so you can't repurpose the same file for other species. This is possible with HAL format (e.g. from HAL you can generate a MAF referenced on any of its species, but it can be slow).
Our current code for processing alignments (using MAF->Zarr) is not optimal. If starting again I'd take a look at https://github.com/ComparativeGenomicsToolkit/taffy/ as an alternative. In the end, what you need for training is being able to random access windows of the genome, ideally with multiple threads at the same time.