aquaskyline/Clairvoyante

High runtime, high memory, low precision and recall using NA12878 data

pjedge opened this issue · 2 comments

I am using the pre-computed model for AJ son to call variants with the 44x PacBio reads for NA12878 from GIAB (sorted_final_merged.bam). Example command (for chr5) is as follows:

clairvoyante.py callVarBam        --chkpnt_fn trainedModels/fullv3-pacbio-ngmlr-hg002-hg19/learningRate1e-3.epoch999        --ref_fn data/genomes/hg19.fa        --bam_fn data/NA12878.1000g/aligned_reads/pacbio/pacbio.blasr.all.44x.bam        --ctgName chr5        --call_fn extra_data/NA12878.1000g/variants/clairvoyante.pacbio.blasr.44x.unfiltered/5.vcf.tmp        --sampleName NA12878       --threshold 0.2        --minCoverage 4        --threads 4

This took 107 CPU-hours (~40 wall-hours) and 28 GB of memory using 4 cpu cores, which indicates that something is wrong.
Also, it seems that the results returned are bad. I'm observing very low precision and recall on chromosomes that successfully complete, indicating random/bad program output. For example, chromosome 5 has precision=0.5069 and recall=0.4438 at GQ=50 (calculated using rtg vcfeval against GIAB truth variants inside confident regions).

Do you know what might be going wrong here?
EDIT: This bam was aligned with BLASR, is clairvoyante very sensitive to using a different aligner for training vs test?

Is the BLASR alignment a public dataset we can download and give a check?

It is -- it's available here:
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NA12878_PacBio_MtSinai/sorted_final_merged.bam
If you could test it out that'd be great!