chrovis/cljam

Features for validating VCF/SAM records

athos opened this issue · 2 comments

athos commented

While we are writing out VCF/SAM records to a file, and if we have some invalid data in them, we often end up seeing a confusing error that's pretty hard to tell what was wrong.

It would be nice to have easy-to-understand error messages in order to make it easier to debug errors, but it's highly likely to degrade the serialization performance. So, how about adding features for validating the input VCF/SAM records to see if they are conformant to the format the cljam writer expects, apart from the existing VCF/SAM file IO APIs.

athos commented

For VCF, I think the validation criteria should include (but not be limited to):

  • Each required field (CHROM, POS, ID, …) has a valid value
  • Each INFO field has a valid value in terms of the ##INFO meta info
  • Each genotype field has a valid value in terms of the ##FORMAT meta info and the FORMAT field
  • If the header has at least one sample column, it also has the FORMAT column
athos commented

I think it would be convenient to have something like validate-variants in the example below so that it can be easily opted in to validate variants when writing them out to a file:

(with-open [r (vcf/reader "in.vcf.gz")
            w (vcf/writer "out.vcf.gz" (vcf/meta-info r) (vcf/header r))]
  (->> (vcf/reader r)
       (map process-variant)
       (validator/validate-variants (vcf/meta-info r) (vcf/header))
       (vcf/write-variants w)))

Or something like this (but I don't think the validator implementation would be so complicated that we have to manage the internal state as a separate validator instance like this):

(with-open [r (vcf/reader "in.vcf.gz")
            w (vcf/writer "out.vcf.gz" (vcf/meta-info r) (vcf/header r))]
  (let [validator (validator/make-validator (vcf/meta-info r) (vcf/header r))]
    (->> (vcf/reader r)
         (map process-variant)
         (validator/validate-variants validator)
         (vcf/write-variants w))))