aquaskyline/Clairvoyante

Non-model species training

andreaminio opened this issue · 7 comments

Hi,

I'd like to apply Clairvoyante on plant species for SNPs and SV detection using PacBio reads. I'm trying to follow your training notebook but I come to some issues I hope you can help me solve to set up correctly the analysis.

  1. I don't have a "true variants" VCF to compare the calls as I have only samples I need to call variants for. As far as I can see, I need it to generate the necessary files for starting the prediction (dataPrepScripts/GetTruth.py). Is there any way around this?
  2. I figured out that training best performs when multiple samples and whole genome calls are used. The notebook reports a training based on just 2 chromosomes, I guess just to learn how to do it. Are also the limits on range used for the same purpose or is there any other reason for that?
  3. How should I take care of repeats? I guess I should mask them somehow before training, but how? Is this the information contained in chr21.bed and chr22.bed files?

Thanks in advance,

Andrea

  1. You don't need to train your own model. We provide PacBio model trained on multiple samples at multiple depths. You can start with that model fullv3-pacbio-ngmlr -hg001+hg002+hg003+hg004-hg19.
  2. Correct.
  3. You can call variants first and later when you know where the repetitive regions are, you can further remove the variants called in those regions.

So, the modelling is not dependent on characteristics of the species? What the model depends on and when do you need to do training to have a dedicated one? I thought I could't make use of a human trained model on grape.
For repeats, if I don't need to do a training, I will just need to mark the variants a s usually. But in the training, if I wold do that, how I do account for them?

Sorry, if I bum the question, but I need to understand if and how I can run your tools.

So, the modelling is not dependent on characteristics of the species? What the model depends on and when do you need to do training to have a dedicated one? I thought I could't make use of a human trained model on grape.
For repeats, if I don't need to do a training, I will just need to mark the variants a s usually. But in the training, if I wold do that, how I do account for them?

The modeling is not dependent on the characteristics of the species. The model depends on the sequencing technology you are using only. And for better performance, a matching aligner and a matching genome version with the model are preferred.

Obviously the genome version cannot be the same, as it is another species. How much does this impact on the quality of the results? And using ngmlr rather than minimap2 in this situation, as I'd have to redo the training but I should anyway make use human genome/data?

genome version difference makes a 0.3% f1-score difference in human genome. I suggest you to give the human model a try and see how it goes.

Thanks Ruibang,

I'll give a try with the model you suggested first.