gymrek-lab/TRTools

feat: encode TRs in PLINK2 PGEN files

aryarm opened this issue · 1 comments

Feature request

Can we have a command or python function to help encode TRs within PLINK2 PGEN files?

Motivation

The PLINK2 PGEN format can encode multi-allelic variants, so in theory, it should be possible to store TRs in the PVAR/PSAM/PGEN formats. In addition to hard calls, PGEN files can also encode dosage information, so it should be possible to store TR allele lengths in the file, as well.

This would allow TRs to be used within many PLINK2 commands and make it much easier to use TRs within any large-scale association analyses. PGEN files are also very space efficient and much quicker to work with than VCFs.

Implementation

Unfortunately, the pgenlib Python library capable of creating PGEN files does not yet have support for multi-allelic variants, but there are plans for it. Quoting from the python_api.txt documentation:

Multiallelic variants aren't fully supported yet. Instead, all ALT alleles are effectively collapsed into one.

But perhaps there's a way for TRTools to create VCFs that could be converted into PGEN files via a command like this?

plink2 --vcf trtools.vcf 'dosage=RUC' --out converted

(Note that I used RUC as the name of the field in the VCF that contains a count of the number of repeat units, since this will be standardized in the new VCF 4.4 discussed here)

Update (7/13): Actually, we should probably avoid using RUC as the name, if we want to do this. RUC is an INFO field whereas I think plink2 imports dosage from a FORMAT field, instead.

aryarm commented

closing this because it is being handled by CAST-genomics/haptools#73 (comment) instead