Features for validating VCF/SAM records
athos opened this issue · 2 comments
While we are writing out VCF/SAM records to a file, and if we have some invalid data in them, we often end up seeing a confusing error that's pretty hard to tell what was wrong.
It would be nice to have easy-to-understand error messages in order to make it easier to debug errors, but it's highly likely to degrade the serialization performance. So, how about adding features for validating the input VCF/SAM records to see if they are conformant to the format the cljam writer expects, apart from the existing VCF/SAM file IO APIs.
For VCF, I think the validation criteria should include (but not be limited to):
- Each required field (
CHROM
,POS
,ID
, …) has a valid value - Each
INFO
field has a valid value in terms of the##INFO
meta info - Each genotype field has a valid value in terms of the
##FORMAT
meta info and theFORMAT
field - If the header has at least one sample column, it also has the
FORMAT
column
I think it would be convenient to have something like validate-variants
in the example below so that it can be easily opted in to validate variants when writing them out to a file:
(with-open [r (vcf/reader "in.vcf.gz")
w (vcf/writer "out.vcf.gz" (vcf/meta-info r) (vcf/header r))]
(->> (vcf/reader r)
(map process-variant)
(validator/validate-variants (vcf/meta-info r) (vcf/header))
(vcf/write-variants w)))
Or something like this (but I don't think the validator implementation would be so complicated that we have to manage the internal state as a separate validator instance like this):
(with-open [r (vcf/reader "in.vcf.gz")
w (vcf/writer "out.vcf.gz" (vcf/meta-info r) (vcf/header r))]
(let [validator (validator/make-validator (vcf/meta-info r) (vcf/header r))]
(->> (vcf/reader r)
(map process-variant)
(validator/validate-variants validator)
(vcf/write-variants w))))