/atlas-annotations-ensembl

Ensembl annotations for Atlas

Primary LanguageShellGNU General Public License v3.0GPL-3.0

Atlas gene annotations from Ensembl

This repository contains scripts that extract gene attributes from large JSON dumps provided by Ensembl for a range of species. Gene attributes extracted from JSON populate the corresponding fields in an output TSV file (a gene-by-annotation table).

The final aggregated gene annotation file is loaded into Solr indexes as a bioentities collection. To run the scripts, please make sure the /bin directory is in the PATH. The following dependencies need to be installed:

  • awk
  • jq (1.5)
  • bats (for testing)

Testing the scripts

run atlas_annotations_ensembl_run_tests.sh in order to execute the tests.

Extract annotations from JSON dumps to TSV formatted files

Entry points

test/run_annotations_from_ensembl.sh The entry point to trigger the conversion of JSON Ensembl annotations to TSV formatted annotations for all the species defined in the ./annsrcs/ensembl This script runs bsub runs on LSF cluster for each species on parallel for efficiency.TODO: include /annsrcs/wbps

By triggering the entry script test/run_annotations_from_ensembl.sh you will effectively run the script bin/annotations_from_ensembl.sh for each species configuration file defined in annsrcs/ensembl

Before running the entry point script it is important to set environmental variables as prerequisite.

## prerequisites - export environmental variables
## Note : These paths are used for testing purpose. While in production, we need these output files to be dumped in traditional $ATLAS_PROD/bioentity_properties/annotations 
export ENSEMBL_JSON_PATH=/hps/nobackup2/production/ensembl/ensprod/search_dumps/release-101b/vertebrates/json (pilot file path provided by Mark from Ensembl for testing)
export ANNOTATIONS_PATH=/ebi/microarray/home/suhaib/json_Ensembl/annotations
export LOG_PATH=/ebi/microarray/home/suhaib/json_Ensembl/logs

The truncated test output TSV file for human is in test/homo_sapiens.ensgene.tsv.
For all other species that are defined in ./annsrcs/ensembl, annotations can be found in this directory:

/ebi/microarray/home/suhaib/json_Ensembl/annotations

Make two column TSV gene attributes file used for decoration

Entry points

test/run_merge_gene_attributes.sh This script is entry point to make column gene attributes files for several species. Effectively, this script trigger several LSF jobs for each species defined in ANNOTATION_PATH i.e output gene annotations path in the Extract annotations from JSON dumps to TSV formatted files task. Which means, the output files from Extract annotations from JSON dumps to TSV formatted files becomes input to Make two column TSV gene attribute files used decoration

Before running the entry point script it is important to set environmental variables as prerequisite.

## prerequisites - export environmental variables
## Note : These paths are used for testing purpose. While in production we need these output files to be dumped in traditional $ATLAS_PROD/bioentity_properties/ensembl 
export ANNOTATIONS_PATH=/ebi/microarray/home/suhaib/json_Ensembl/annotations
export GENE_ATTRIBUTES_PATH=/ebi/microarray/home/suhaib/json_Ensembl/ensembl

By triggering the entry point script test/run_merge_gene_attributes.sh this script eventually calls the bin/merge_gene_attributes.sh taking desired arguments field1 and field2 to concatenate for all the species in OUTPUT_TSV_PATH

For example, make two column tsv file with attributes 'ensgene' (gene_id) and 'symbol' (gene name) that is used for decoration (rempapping of gene ids with gene names in atlas production) as shown in for human test/homo_sapiens.ensgene.symbol.tsv from using input file homo_sapiens.ensgene.tsv located in OUTPUT_TSV_PATH

For all other species that are defined in ./annsrcs/ensembl can found in below directory

cd /ebi/microarray/home/suhaib/json_Ensembl/ensembl

Debug for empty columns(gene attributes)

There will be instances where gene attribute values will be missing in the E! JSON dumps. For those species and attributes, the missing values can be found by running the script below:

export ANNOTATIONS_PATH=/ebi/microarray/home/suhaib/json_Ensembl/annotations
bash test/run_check_empty_columns.sh

Running this script will scan through all the TSVs column-wise to determine which column is totally empty. It will not identify columns if only few or more values are missing.

[WIP] Microarray array designs test queries

export ENSEMBL_JSON_PATH=/hps/nobackup2/production/ensembl/ensprod/search_dumps/release-101b/vertebrates/json
bash test/test_array_design_queries.sh

Structure

./annsrcs

Annotation declared as jq query in the species-wise configuration files. These files describe the mapping of Atlas properties into the bioentities collection.

./bin

Executables that extract Atlas annotations in a desired format.

./test

Main execution run_* scripts that executes scripts in /bin. Also the source of example output tsv files.