A repository for annotating, interpreting, reporting and visualizing germline SNPs and small indels in exome data.
-
List of RefSeq GRCh37 transcripts integrated with Ensembl, LRG and clinical transcripts. This file was created with create_grch37.refseq_ensembl_lrg_hugo.R and contains information about:
HGNC_symbol
,HGNC_alternative_symbol
,refSeq_mRNA
,refSeq_protein
,refSeq_mRNA_noVersion
,refSeq_protein_noVersion
,ENSGene
,ENSTranscript
,LRG_id
,clinical_transcript
. -
grch37.clin.manual.refseq_ensembl.txt
List of RefSeq transcripts used in Human Gene Mutation Database (HGMD). This file was manually curated and contains information about:
HGNC_symbol
,ENSGene
,ENSTranscript
,refSeq_mRNA
andrefSeq_protein
. -
List of genes that changed their names between genome versions GRCh37 and GRCh38. This list was retrieved from Ensembl.
-
Creates NCBI RefSeq BED file with clinical transcripts information. This script accepts RefSeq BED files (hg19) from UCSC Table Browser
--exons=="path/to/exons.bed"
and--introns=="path/to/introns.bed"
, as well as a BED file with RefSeq, Ensembl, LRG and clinical information--RefSeq==path/to/RefSeqGRCh37_Ensembl_LRG_clinical.txt
. The output file contains information about:Chr
,Start 1-based
,End
,Rank.Exons.Introns
,Strand
,HGNC_symbol
,HGNC_alternative_symbol
,ENSGene
,ENSTranscript
,refSeq_mRNA
andrefSeq_protein
,refSeq_mRNA_noVersion
,refSeq_protein_noVersion
,LRG_id
.usage:
create_RefSeqBED.R --exons=="path/to/exons.bed" --introns=="path/to/introns.bed" --RefSeq=="path/to/RefSeqGRCh37_Ensembl_LRG_clinical.txtt"
-
create_grch37.refseq_ensembl_lrg_hugo.R
Creates RefSeqGRCh37_Ensembl_LRG_clinical.txt. This script downloads NCBI RefSeq GRCh37 from here, integrates GRCh37vs38, gets Ensembl Gene and Transcripts IDs from BioMart, gets LRG GRCh37 transcripts from here, and integrates manually curated clinical transcripts from HGMD. It outputs all these information in
RefSeqGRCh37_Ensembl_LRG_clinical.txt
and writes a list of genes without any clinical transcript defined and/or a list of genes without RefSeq_mRNA inRefSeqGRCh37_Ensembl_LRG_clinical.checklist.txt
.usage:
create_RefSeqGRCh37_Ensembl_LRG_clinical.R
-
Gets ClinVar GRCh37 vcf file from ClinVar. This script runs with default values for downloading ClinVar directory, and vcfanno configuration file directory. These parameters can be chosen:
usage:
get_clinvar.R <ClinVar download directory> <vcfanno config file directory>
-
Gets the first and last position of a gene. The output is written in a sorted filed by coordinates:
Chr
,First Position
,Last Position
,Gene
.usage:
get_genesCoordinates.py [-h] -genes_file REFSEQ -outname OUT_FILE
-
Gets the variants that have HGVS information (cDNA and protein) from VEP (v.95) output (vcf) with the human genome (hg19). The output is a vcf with:
Chr
,First Position
,Last Position
,RefSeq_mRNA
,HGVS c.
,RefSeq_prot
,HGVS p.
.usage:
get_hgvsnomenclature.py [-h] -vcf VCFFile -outname OUT_FILE
-
Gets the number of samples in a VCF file with a given variant (either in homozigosity or heterozigosity) and the corresponding allele frequency. The output is writen in a sorted file containg:
Chr
,Position
,Ref
,Alt
,Number of samples with variant
,Number of homozygous samples
,Allele Frequency
.usage:
get_inHouseFreq.py -vcf <multisample VCF file> -outname <output name>
-
Gets MutationTaster prediction scores. The output doesn't get transcript information into consideration, presenting unique values for a certain variant. This script accepts an input file with
Chr
,Position
,Ref
,Alt
.usage:
get_mutationtaster.R --variants=="path/to/variants_file.txt"
can be parallelized with:find <path/to/variants_files*> | xargs -n1 -P 20 -I {} sh -c 'echo {} && ./get_mutationtaster.R --variants=={}'
-
Gets multi-sample VCF file with vcf-merge, split file by chromosome, and calculates allele frequency with
get_inHousefreq.py
. This script runs with default values for every parameter:Parameter Default value -v
current directory -d
same as -v
-o
inHouse_freq _ -p
24 usage:
get_vcfmerge2freq.sh [-h] [-v <path to VCF files>] [-d <path to output>] [-o <output name>] [-p <number of process to run>]
-
Gets UMD-predictor scores for a list of Ensembl transcripts IDs. The output is written in a sorted file containg:
Chr
,Position
,Ref
,Alt
,HGVS_c
,HGVS_p
,HGNC_symbol
,ENSTranscript
,UMD_pred
,UMD_score
usage:
get_UMDpredictor.R [--help] --ENSTranscripts=="path/to/ENSTranscripts.txt"
-
Downloads and runs
get_vcfmerge2freq.sh
andget_inHouseFreq.py
in present work directory, which will be used as default directory for getting VCF samples files and output Allele Frequency file.usage:
GENEVA_AlleleFrequency.sh
-
Downloads RefSeq BED files (hg19) from UCSC Table Browser
exons.bed
andintrons.bed
, as well as, RefSeqGRCh37_Ensembl_LRG_clinical.txt created with create_RefSeqGRCh37_Ensembl_LRG_clinical.R. It also downloads and runscreate_RefSeqBED.R
andget_genesCoordinates.py
in present work directory. This pipeline outputs the following files inRefSeq_annotation/
directory:
- clinical/ (uncomment script to write the same outputs for complete BED)
RefSeqGRCh37_clinical_coordinates.txt
, file with start and end of each gene inRefSeqGRCh37_clinical_sort.bed.gz
useful for tabix in posterior analyses.RefSeqGRCh37_clinical_coverage.bed
, input file to use in coverage analysis.RefSeqGRCh37_clinical_hdr_sort.bed.gz
andRefSeqGRCh37_clinical_hdr_sort.bed.gz.tbi
, sorted and indexed BED file with clinical transcripts
usage: GENEVA_RefSeqBED.sh
-
This script crosses the name of the genes in both versions of human genome (GRCh37 and GRCh38) and write a file with the genes that has changed.
usage:
checkGenesNamesgrch37vsgrch38.py [-h] -refSeq REF_FILE -twoVersions BOTH_FILE -outname OUT_FILE
-
Converts a VCF file into a TSV with the corresponding header. The output file is written in the same directory as the input.
usage:
vcf2table.R --vcf=="path/to/input_file.vcf" [optional: --header=="CHROM, POS, ID, REF, ALT"]