This application takes a .bam input in hg19 coordinates and yields a .csv with the variants annotated by SNP name, regions relevant for phylogeny and potential aDNA damages in hg38 coordinates.
- Used technologies
- Description
- Setup (Debian/Ubuntu)
- Running the annotation
- References
- Debian GNU/Linux 11
- Python 3.9.2
- SAMtools 1.16.1
- BCFtools 1.16
Also check the dependencies and the requirements.txt.
Variant calling is performed using BCFtools, keeping the positions with the ancestral genotype. The .csv output is sorted by the Phred quality score.
For the SNP annotation YLeaf and Ybrowse lists of annotated SNPs in hg38 coordinates are used. First we annotate the list of positions called from the sample against the YLeaf list and then we add the Ybrowse annotated positions that are in the sample but not in Yleaf file, giving thus priority to Yleaf annotations.
The change of coordinates is done with pyliftover.
We also annotate the following substitutions prone to aDNA damage due to deamination of cytosines [1]:
- C to T
- G to A
In the process we annotate the regions not suitable for phylogeny as per YSEQ:
- Pseudo Autosomal Region 1 (PAR1): chrY:1..2781479
- Synthetic Assembled Centromeric Region (CEN): chrY:10072350..11686750
- DYZ19 125 bp repeat region: chrY:20054914..20351054 (DYZ19 125 bp repeat region)
- Post Palindromic Region: chrY:26637971..26673210
- Pseudo Autosomal Region 2, (PAR2): chrY:56887903..57217415
The following installation has been tested with Debian GNU/Linux 11. You can try the install_dependencies.sh
script to install the dependencies. This script also allows the installation of SAMtools and BCFtools using the apt package manager.
You may want to check this useful links: link1 and link2.
You may want to check if there are newer SAMtools & BCFtools releases. To install another version just make the relevant replacements in the code below.
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install gcc
sudo apt-get install make
sudo apt-get install libncurses5-dev
sudo apt-get install liblzma-dev
sudo apt-get install zlib1g-dev
sudo apt-get install libbz2-dev
sudo apt-get install libcurl4-openssl-dev
cd /usr/bin
sudo wget https://github.com/samtools/samtools/releases/download/1.16.1/samtools-1.16.1.tar.bz2
sudo tar -vxjf samtools-1.16.1.tar.bz2
cd samtools-1.16.1
sudo make
sudo make install
export PATH="$PATH:/usr/bin/samtools-1.16.1"
source ~/.profile
cd ..
sudo wget https://github.com/samtools/bcftools/releases/download/1.16/bcftools-1.16.tar.bz2
sudo tar -vxjf bcftools-1.16.tar.bz2
cd bcftools-1.16
sudo make
sudo make install
export PATH="$PATH:/usr/bin/bcftools-1.16"
source ~/.profile
git clone https://github.com/fj-blanco/bam-to-ychr-calls
cd bam-to-ychr-calls
mkdir ./genomes/
mkdir ./SNP_annotations_hg38/
mkdir ./references/
wget -O ./SNP_annotations_hg38/yleaf_new_positions.txt https://github.com/genid/Yleaf/raw/master/yleaf/data/hg38/new_positions.txt
wget -O ./SNP_annotations_hg38/yleaf_old_positions.txt https://github.com/genid/Yleaf/raw/master/yleaf/data/hg38/old_positions.txt
wget -O ./SNP_annotations_hg38/ybrowse_snps_hg38.csv http://ybrowse.org/gbrowse2/gff/snps_hg38.csv
wget -O ./references/human_g1k_v37.fasta.gz http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
gzip -d ./references/human_g1k_v37.fasta.gz
You may want to create a Python virtual environment to install the dependencies, for example with:
python3 -m venv .bam-to-ychr-calls-venv
source .bam-to-ychr-calls-venv/bin/activate
And then install the dependencies:
pip install --no-cache-dir -r requirements.txt
As an example we will use cay0007 from the recent paper A genomic snapshot of demographic and cultural dynamism in Upper Mesopotamia during the Neolithic Transition.
wget -O ./genomes/sample.bam ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR957/ERR9576228/cay007_merged_220118.hs37d5.cons.90perc.bam
wget -O ./genomes/sample.bam.bai ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR957/ERR9576228/cay007_merged_220118.hs37d5.cons.90perc.bam.bai
Note: if the index is not available, it will be created automatically by SAMtools.
The data processing is wrapped in a bash script that calls the necessary processing with SAMtools and BCFtools and then runs the vcf_processing.py
Python script for the annotation.
Give the script execute permissions:
chmod +x bam_to_ychr_calls.sh
The script can be run with as follows:
./bam_to_ychr_calls.sh -s <sample_name> -r <reference_file_path> [options]
where the options are:
-s
(--sample
): Mandatory. Path to the sample .bam file.-r
(--reference
): Mandatory. Path to the reference file.-o
(--output_dir
): Optional. Directory to save the results. Defaults to ./results/.-a
(--all_calls
): Optional. Flag to annotate all calls. If not set, only variants are annotated.-q
(--quality_threshold
): Optional. Quality threshold for filtering on Phred-scaled quality score (QUAL). If not set, no quality filter is applied.
For example, to run the script with the sample and the reference files downloaded above, and to annotate just the variants with QUAL >= 20:
./bam_to_ychr_calls.sh -s ./genomes/sample.bam -r ./references/human_g1k_v37.fasta -q 20
[1] Briggs, A.W., Stenzel, U., Johnson, P.L., Green, R.E., Kelso, J., Prüfer, K., Meyer, M., Krause, J., Ronan, M.T., Lachmann, M. and Pääbo, S., 2007. Patterns of damage in genomic DNA sequences from a Neandertal. Proceedings of the National Academy of Sciences, 104(37), pp.14616-14621.