elimuinformatics/vcf2fhir

Conversion not returning all variant entries

clake-deloitte opened this issue · 4 comments

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest version
  • I checked the documentation and found no answer
  • I checked to make sure that this issue has not already been filed

Context

Hello, I have been using vcf2fhir on my test VCFs, and I have noticed that no INDEL type variants are included in the JSON product following conversion. There are quite a few entries that are missing beyond INDELs as well, and I am not sure why. Is there a way to set it that all of the variant entries convert, regardless of type? I don't believe any entry is missing information.

Expected Behavior

I am expecting all of the variant entries to be included in the final JSON output.

Current Behavior

Only a subset of the variants are included in the JSON output following the conversion.

Steps to Reproduce

The code I am using for the conversion is as follows:

import vcf2fhir
vcf_fhir_converter = vcf2fhir.Converter('/test_1000.vcf', ref_build='GRCh37', genomic_source_class='mixed', patient_id='patient_ID')
vcf_fhir_converter.convert(output_filename='/test_1000.json')

Failure Logs

I am unable to attach VCF or JSON files, but I would be more than happy to send them via email if you'd like to see them.

Hi @clake-deloitte , generally this is because the VCF rows in question are meeting some exclusion criterion (described here). A nice way to see why a given row isn't converting is to enable and check the invalid record log (described here). Can you give that a try?

Hi, yes this is actually very helpful. However, I'd like to have all the variants kept - and it seems that all of the variant entries have the same error (which is why they're being dropped):

2023-05-02 16:21:28,091 - vcf2fhir.invalidrecord - DEBUG - Reason: VCF FORMAT.GT is in ['0/0','0|0','0'], Record: Record(CHROM=1, POS=55416, REF=G, ALT=[A]), considered sample: CallData(GT=0|0, DS=0.05, GL=[-0.48, -0.48, -0.48])

Is there any way to forego this dropping of these variants at all, short of modifying the VCFs directly? Thanks!

Hi @clake-deloitte, unfortunately there is no way to forego the dropping of the variants without a code change (or changing the VCFs as you mention). If you do edit the code, there might be other consequences that need to be tested - for instance, calculating allelic state relies on genotype.

Ok, no problem, thank you for the information!