feat: encode TRs in PLINK2 PGEN files
aryarm opened this issue · 1 comments
Feature request
Can we have a command or python function to help encode TRs within PLINK2 PGEN files?
Motivation
The PLINK2 PGEN format can encode multi-allelic variants, so in theory, it should be possible to store TRs in the PVAR/PSAM/PGEN formats. In addition to hard calls, PGEN files can also encode dosage information, so it should be possible to store TR allele lengths in the file, as well.
This would allow TRs to be used within many PLINK2 commands and make it much easier to use TRs within any large-scale association analyses. PGEN files are also very space efficient and much quicker to work with than VCFs.
Implementation
Unfortunately, the pgenlib
Python library capable of creating PGEN files does not yet have support for multi-allelic variants, but there are plans for it. Quoting from the python_api.txt documentation:
Multiallelic variants aren't fully supported yet. Instead, all ALT alleles are effectively collapsed into one.
But perhaps there's a way for TRTools to create VCFs that could be converted into PGEN files via a command like this?
plink2 --vcf trtools.vcf 'dosage=RUC' --out converted
(Note that I used RUC
as the name of the field in the VCF that contains a count of the number of repeat units, since this will be standardized in the new VCF 4.4 discussed here)
Update (7/13): Actually, we should probably avoid using RUC
as the name, if we want to do this. RUC
is an INFO field whereas I think plink2 imports dosage from a FORMAT field, instead.
closing this because it is being handled by CAST-genomics/haptools#73 (comment) instead