Correlate calls from bed files to vcf to determine percent match to archaic genotypes.
All dependencies are specified in the environment.yaml file. A conda environment can be created with
conda create --name bed2vcf --file environment.yaml
Once created, activate the environment with: conda activate bed2vcf
All unit tests can be run within the conda environment by calling pytest
There are two main functions in the top level directory
This function performs filtering of vcfs based on individual names or bed files to retain. The script thin_vcf.slurm shows a submission along with the vcftools implementation. Leveraging zcat and gzip to perform compression and using the sorted nature of both files, this runs more quickly than the vcftools version.
This performs the main analysis and is run with run.slurm. Bed files are
provided with the --bed_files
flag and outputs are written to the
--output_dir
with the same base filename as the inputs, appended with
".matched". Vcfs should be provided for each chromosome and labeled with a
wildcard for python formatting. E.g. if the files are named chrom_1_filter.vcf
the command line expects the input chrom_{chr}_filter.vcf and will expand
to each chromosome in [1, 22]
Currently only bed output is supported. Each row corresponds to a row in the input bed file. The columns are:
- Chromosome
- Start position
- End position
- Number of entries in the modern VCF within this region.
- Number of derived variants within the region for a haplotype
- Number of derived variants in the modern VCF joined with archaic. Intersection of sites, not necessarily matching variants.
- Number of derived variants in the archaic VCF within the region (number of alleles/2)
- Number of derived variants in archaic which match a derived variant in the modern vcf.
MIT © Troy Comi