sgdp-fermi: A repository from lh3

Overview

This repository hosts the unphased FermiKit variant calls for the 263 public SGDP samples across 128 diverse populations. If you use these data, please cite the following paper:

Mallick et al (2016) The Simons Genome Diversity Project: 300 genomes from 142 diverse populations, Nature.

The data (946MB) can be downloaded through the release page of this repository, or via wget:

wget https://github.com/lh3/sgdp-fermi/releases/download/v1/sgdp-263-hs37d5.tgz

Unpacking the tar-ball creates the sgdp-263-hs37d5 directory which consists of the following files:

sgdp-263-hs37d5
|-- 263.bgt.spl         # sample names and sample information (text)
|-- 263.bgt.bcf         # site list (in BCFv2)
|-- 263.bgt.pbf         # genotypes (BGT custom binary)
|-- um75-hs37d5.bed.gz  # 75bp universal mask (regions to be filtered out)
|-- um35-hs37d5.bed.gz  # 35bp universal mask
|-- vep-impact.fmf.gz   # VEP annotations excluding MODIFER impact (gzip'd text)
`-- bgt                 # BGT precompiled binary for x64-linux

In particular, sites outside um75-hs37d5.bed.gz but absent from 263.bgt.bcf should be regarded as homozygous reference. This is important to many popgen analyses.

The genotypes are stored in the BGT format, which can be exported to VCF by the bgt command-line tool:

# export all variants to VCF
bgt view -C 263.bgt > 263.all.vcf
# exclude sites overlapping filtered regions
bgt view -CeB um75-hs37d5.bed.gz 263.bgt > 263.conf-only.vcf
# common variants only
bgt view -f'AC/AN>.05' 263.bgt
# VCF for one sample in a region (note the comma following -s)
bgt view -r 11:1,000,000-2,000,000 -s,S_French-1 -f'AC>0' 263.bgt
# VCF for two samples
bgt view -s,S_French-1,S_French-2 -f'AC>0' 263.bgt
# VCF for East Asian males (see 263.bgt.spl for annotations)
bgt view -s'region=="EastAsia"&&gender=="M"' -f'AC>0' 263.bgt
# coding variants
bgt view -d vep-impact.fmf.gz -a'cdsPos>0' 263.bgt

Please check out the BGT README for more advanced uses of bgt.

We are also releasing small variants called from human reference genome GRCh38, though this call set lacks universal masks and variant annotations.

Data Processing

Each sample was independently de novo assembled with fermikit-0.8, mapped with bwa-0.7.12 to reference genome hs37d5 and then sorted:

fermi.kit/fermi2.pl unitig -t 8 -p utg -s 3g "fermi.kit/trimadap reads.fq.fz" > utg.mak
make -f utg.mak  # this takes a couple of wall-clock days
fermi.kit/bwa mem -x intractg hs37d5.fa utg.mag.gz | gzip -1 > utg.sam.gz
fermi.kit/htsbox samsort -S utg.sam.gz > utg.srt.bam

For 100bp reads, fermikit-0.8 should produce very similar results to the lastest fermikit-0.12. After mapping, small variants are called from all samples together and filtered:

fermi.kit/htsbox pileup -cuf hs37d5.fa *.srt.bam | bgzip > raw.vcf.gz
fermi.kit/k8 fermi.kit/hapdip.js vcfsum -f raw.vcf.gz | bgzip > flt.vcf.gz

The pileup command line does not apply any statistical modeling. It simply extracts unitig-reference differences and produces a multi-sample VCF. The filtering script marks variants with 1) <50% calling rate or 2) <10 supporting reads in the sample with the highest allele depth. Post-filtered variants are then imported to BGT:

bgt import -S flt.vcf.gz 263.bgt

with sample information added later. Functional annotations were provided by Ensembl Variant Effect Predictor, version 80:

./variant_effect_predictor.pl -i input.vcf -o output.txt --offline --pick \
                              --cache --everything --quiet

Additional Data Produced by FermiKit

Due to the space limit, we are only providing genotypes in highly compressed BGT files for download here. We also have multi-sample VCF with read depth information (15GB compressed), FermiKit unitigs with read depth information (882GB) and the unitig FM-index for 232 samples with low-level microbiome sequences (38GB). Please contact us if you need these data.