Table of Contents
This is a package that provide tools to extract different genome architecture features, build taxonomy lineage and test features against certain value in taxonomy rank in fully automated pipeline.
The package is very user friendly allowing users to ask question like:
- Does Vertebrates have a significant GC content than other species ?
- What are the genome architecture features that significant in Eukaryotes ?
- What are the features that signifies Betacoronavirus from other Riboviria ?
Output of the genomeAnalytica package is a csv similar to common deferential expression experiments (features with adjusted p-values and averages) and a boxplot figures for significant features.
The package is providing an end to end dedicative pipeline for investigating genomic architecture features which is unique to this package as far as we know.
- Option 1 - Manual
Download and install the latest version of the following.
- BASH
bedtools
entrez-direct - Python
pandas
scipy
statsmodels
matplotlib
seaborn
- Option 2 - Automated
Make conda environment and use the requirements.txt to install needed libraries
$ conda create --name <env_name> --file requirements.txt
$ conda activate <env_name>
- Make data directory and under it make raw directory under it make directory for each species and create a txt file containing 2 ftp links 1 for species genome gtf file and another for species genome fasta file like the following tree:
data
├── raw
├── Caenorhabditis_elegans
│ └── WgetMe.txt
├── Choloepus_hoffmanni
│ └── WgetMe.txt
├── Chrysolophus_pictus
│ └── WgetMe.txt
├── Ciona_intestinalis
│ └── WgetMe.txt
├── Ciona_savignyi
│ └── WgetMe.txt
├── Drosophila_melanogaster
│ └── WgetMe.txt
├── Homo_sapiens
│ └── WgetMe.txt
├── Monopterus_albus
│ └── WgetMe.txt
├── Mus_musculus
│ └── WgetMe.txt
└── Saccharomyces_cerevisiae
└── WgetMe.txt
For using the same species as above just download from data
- Download and unzip the data at each corresponding directory using the following command.
$ ./get_data.sh data/raw WgetMe.txt
- Make features for all species under data/raw. Can be done either using a Fast Mode mode or Expert Mode mode
- Option 1 - Fast Mode:
e.g: creating statistics for (all, protein coding, non protein coding) for (gene, exon, transcript, three_prime_utr, five_prime_utr)
$ ./make_arch_fet.sh data/raw/ g pg npg e pe npe t pt npt f pf npf h ph nph
- Option 2 - Expert Mode:
e.g: creating statistics for protein coding exons
for i in `ls data/raw/`; do echo $i; ./get_len_gc.sh data/raw/$i/ exon -i transcript_biotype=protein_coding ; done
- Make taxonomy lineage for all species under data/raw using the following command
$ ./make_tax.sh data/raw/ tax_mapping.csv
- Make test for calculated feature under data/output using the taxonomy lineage and specifying the 10th rank (supphilum) and (vertebrata) as the value to compare against using the following command:
$ ./make_arch_test.py --stats data/output/ --tax data/tax_mapping.csv --tax_value vertebrata
The package consists of three main commands to build standard features, build mapping and test them supplementary commands are available to batch download ftp links and build custom genome architecture features.
- Description:
This script loop throw the species directories under user specified species containing directory; reading the user specified text file containing the ftp links for gtf and fasta then unzip them - Positional Arguments
There are 2 mandatory arguments
arg1: the directory containing the species directories
arg2: the name of the files that contains the ftp link to download - Example:
$ ./get_data.sh data/raw WgetMe.txt
- Description
This module gets stats (avg gc content, avg length and count ) for a custom user specified feature and filtration in a gtf file. Features extracted can differ from those served out of the box in make_arch_fet. - Positional Arguments
There are 2 mandatory arguments :
arg1: the path to directory containing the species gtf and fasta files (uncompressed) with the following hirarchy: raw/species/ e.g (data/raw/Homo_Sapiens)
arg2: the feature that will be filtered for stats. this feature should exist in column 3 gtf file. e.g (exon, gene, transcript)
There are 2 optional arguments to be used when you want to filter certain tag in column 9 gtf:
arg3: a flag to include or exclude the following (tag=value) argument. to include use -i , to exclude use -e
arg4: the name of the tag you want to filter tigetherr with the filtering value in the form of (tag=value). e.g (-i gene_biotype=protein_coding) - Example:
- to calculate the stats of all genes of Homo Sapiens use the following
$ bash get_len_gc.sh dat/raw/Homo_Sapiens gene
- to calculate the stats of exons coming from non-protein coding transcript of Homo Sapiens use the following
$ bash get_len_gc.sh dat/raw/Homo_Sapiens exon -e transcript_biotype=proteing_coding
- Description
This script generalize applying stats like (avg gc content, avg length, count ) for all species under user input containing dir given that each species dir contains gtf and fasta file for species genome - Positional Arguments
There are minimum 2 mandatory arguments :
arg1: the path to directory containing all the species directories containing gtf and fasta files (uncompressed). Dir must be named raw e.g (data/raw/)
arg2,3,4,...: 1 to 3 characters representing the features that will be filtered for stats. with the following naming:
filtration criteria: can be:
p for pretein coding or
np for non proteing coding
feature type: can be:
g for gene
e for exon
t for transcript
f for five_prime_utr
h for three_prime_utr
e.g pg for protein coding gene; npt for non protein coding three_prime_utr - Example:
to calculate the stats of (non protein coding genes, protein coding exons and all five_prime_utr) for all species under data/raw/ use the following
$ ./ make_arch_fet.sh data/raw/ npg pe f
To calculate all features
$ ./make_arch_fet.sh data/raw/ g pg npg e pe npe t pt npt f pf npf
- Description
This script loop throw the species directories under user specified species contining directory; using the species dir naming the taxonomy lineage record is retrieved from ncbi, appended in csv file and saved in user specified file - Positional Arguments
There are 2 mandatory arguments
arg1: the directory containing the species directories
arg2: the name of the taxonomy mapping csv file to be saved - Example:
$ ./make_tax.sh data/raw/ tax_mapping.csv
- Description
This script takes the path for data created by make_arch_fet, path for species lineage csv created by make_tax, rank value to test againist. Count the nulls (which is filled with 0 and returned as 0_count) in each feature, applies 2 sample 2 tailed ttest and Manwhetney, correct using benferroni and returns
i. parsed data architicture feature in 1 csv
ii. test results containing (feature,0_count,avg_[rank_value],avg_non_[rank_value],MannWhitney_p,ttest_p,MannWhitney_adj_p,ttest_adj_p,MannWhitney_adj_reject,ttest_adj_reject)
iii. filtered parsed data architicture features for only significant ones according to corrected p-value < 0.05 - Named Arguments:
-h, --help: show this help message and exit
--stats [STATS]: Directory containning the nested csv stats for all species in form of (species, stats)
--tax [TAX]: File csv contianing the full lineage of all species in stats
--out_test [OUT_TEST]: File to write the test results
--out_parsed [OUT_PARSED]: File to write parsed data
--out_parsed_filtered [OUT_PARSED_FILTERED]: File to write parsed data with filtered significant features
--out_fig [OUT_FIG] png File to write boxplot figure
--tax_value [RANK_VALUE]: value to test for in the tax rank e.g: Vertebrata to test Vert vs In-Vert in Subphylum - Example:
$ ./make_arch_test.py --stats data/output/ --tax data/tax_mapping.csv --tax_value vertebrata
- test results
feature | 0_count | avg_vertebrata | avg_non_vertebrata | MannWhitney_p | ttest_p | MannWhitney_adj_p | ttest_adj_p | MannWhitney_adj_reject | ttest_adj_reject |
---|---|---|---|---|---|---|---|---|---|
gene_-i_gene_biotype=protein_coding_count | 0 | 17616.4 | 12821.4 | 0.000145036 | 4.61E-08 | 0.005221295 | 1.66E-06 | TRUE | TRUE |
exon_-e_transcript_biotype=protein_coding_avg_len | 0 | 247.5974 | 382.6778 | 0.000145036 | 3.91E-05 | 0.005221295 | 0.001405841 | TRUE | TRUE |
- siginificant features boxplot
Ahmed Tarek
Abdullah Alkhawaja
Ali El-Nisr
Montaser Bellah Yasser