GOThresher removes annotation bias from GAF files based on GO term information content, GO evidence, annotation source, number of proteins annotated from a given source, and date. GOThresher accepts one or more GAF files as input. The motivation for GOThresher lies in the observation that protein function annotations are biased due to high throughput experimental studies (1). Removing such annotation biases can help present a more balanced picture of protein annotations for a given organism or set of proteins.
GOThresher requires Python 3.5 or newer with the following libraries installed:
Modules can be automatically installed using pip
, or obtained from their respective websites.
If GOThresher is installed using conda, none of the above pre-requisites are needed.
GOThresher requires an obo formatted version of the Gene Ontology. Depending on your needs, this would usually be one of go-basic.obo or go.obo. For more details and to download either the most recent daily version or the latest version go to the Gene Ontology website.
Additionally, a config file that defines parameters to generate mapping files is required. This will be a .ini
file, which can be downloaded from here. If user clones this repository in the following installation step, the .ini
and ExampleData/go.obo
files will be included.
There are three different ways to install GOThresher:
1. Installation using pip
:
GOThresher is available on PyPi, which you can install using pip
:
$ pip install gothresher
Do note, this approach will not download any data and config files that are available in the GitHub repository. User will have to clone this repository separately or download example data and .ini config file required to run GOThresher from Figshare.
2. Installation using conda
:
GOThresher can be installed from Conda. Please note, user will have to download config and .obo
files required to run GOThresher separately by cloning this GitHub repository, or from Figshare.
It is recommended to create a separate Conda environment and install GOThresher into it. This allows having the correct version of all the dependencies isolated from the system's.
$ conda create --name gth
Activate the environment:
$ conda activate gth
Install GOThresher in the isolated conda environment by running:
$ conda install -c bioconda gothresher
3. Installation from source: Alternatively, it is possible to manually download from GitHub or clone the repository using the following command:
$ git clone https://github.com/FriedbergLab/GOThresher
$ cd GOThresher
and install GOThresher by running:
$ pip install .
After installation, follow the steps described below for detailed instructions about how to use GOThresher. Alternatively, you can jump to example usage
GOThresher requires graphs of the three ontologies (MF, CC, BP), mapping of GO terms to all of its ancestors, and mapping of alternate GO IDs to actual GO IDs. These files can be generated by running gothresher_prep
.
gothresher_prep
requires a .ini file
and a .obo
file. These files can be downloaded into the working directory by cloning this GitHub repository, or from Figshare. Once these files are obtained in the working directory, run the following command:
Run this command only once to generate the mapping files
$ gothresher_prep -c gothresher.ini -i ExampleData/go.obo
gothresher.ini
is file that defines parameters to generate mapping files.
onto_dir
: Name of output directory where the mapping files will be saved. Default is set toonto
but can be changed as per user preferences.root_bpo
: GO ID for the root term of BPO graphroot_cco
: GO ID for the root term of CCO graphroot_mfo
: GO ID for the root term of MFO graph
go.obo
is the ontology file supplied within the ExampleData
directory.
Users can choose to download the go.obo
or go-basic.obo
file of their choice from http://www.geneontology.org/ontology/ instead of using our provided go.obo
file. Make sure to provide the appropriate path to your .obo
file while running gothresher_prep
.
gothresher_prep
will generate seven files in total:
- Three files corresponding to the three ontologies
- Three files corresponding to the mapping between each GO term and its ancestors in its own respective ontology
- One file containing mapping from alternate GO_ID to actual GO_ID.
IMPORTANT: This command needs to be run again when a new version of ontology is available and updated graphs/mapping files need to be used for analysis. In that case, please use gothresher_prep
after downloading a new go.obo file.
Following files will be generated within the user specified <onto_dir>
folder:
1. ./<onto_dir>/alt_to_id.graph : Needed to obtain mapping from alternate GO_ID to actual GO_ID
2. ./<onto_dir>/mf.graph : The MFO Ontology graph
3. ./<onto_dir>/bp.graph : The BPO Ontology graph
4. ./<onto_dir>/cc.graph : The CCO Ontology graph
5. ./<onto_dir>/mf_ancestors.map : The MFO Ancestors map
6. ./<onto_dir>/bp_ancestors.map : The BPO Ancestors map
7. ./<onto_dir>/cc_ancestors.map : The CCO Ancestors map
-
Download the latest
go.obo
orgo-basic.obo
file from http://www.geneontology.org/ontology/ -
Run the program
gothresher_prep
and provide the downloadedobo
file as well as the config file included in this repository. See the usage details here. This program needs to be run only when a newobo
file needs to be used. -
Run the program
gothresher
A glossary and a more expanded explanation for some of the command line options are available in the Manual
usage: gothresher [-h] [--prefix PREFIX] [--cutoff_prot CUTOFF_PROT]
[--cutoff_attn CUTOFF_ATTN] [--output OUTPUT]
[--evidence EVIDENCE [EVIDENCE ...] | --evidence_inverse
EVIDENCE_INVERSE [EVIDENCE_INVERSE ...]] --input INPUT
[INPUT ...] [--aspect ASPECT [ASPECT ...]]
[--assigned_by ASSIGNED_BY [ASSIGNED_BY ...] |
--assigned_by_inverse ASSIGNED_BY_INVERSE
[ASSIGNED_BY_INVERSE ...]] [--recalculate RECALCULATE]
[--info_threshold_Wyatt_Clark_percentile INFO_THRESHOLD_WYATT_CLARK_PERCENTILE | --info_threshold_Wyatt_Clark INFO_THRESHOLD_WYATT_CLARK]
[--info_threshold_Phillip_Lord_percentile INFO_THRESHOLD_PHILLIP_LORD_PERCENTILE | --info_threshold_Phillip_Lord INFO_THRESHOLD_PHILLIP_LORD]
[--verbose VERBOSE] [--date_before DATE_BEFORE]
[--date_after DATE_AFTER] [--single_file SINGLE_FILE]
[--select_references SELECT_REFERENCES [SELECT_REFERENCES ...]
| --select_references_inverse SELECT_REFERENCES_INVERSE
[SELECT_REFERENCES_INVERSE ...]] [--report REPORT]
optional arguments:
-h, --help show this help message and exit
--prefix PREFIX, -pref PREFIX
Add a prefix to the name of your output files.
--cutoff_prot CUTOFF_PROT, -cprot CUTOFF_PROT
The threshold level for deciding to eliminate
annotations which come from references that annotate
more than the given 'threshold' number of PROTEINS
--cutoff_attn CUTOFF_ATTN, -cattn CUTOFF_ATTN
The threshold level for deciding to eliminate
annotations which come from references that annotate
more than the given 'threshold' number of ANNOTATIONS
--output OUTPUT, -odir OUTPUT
Writes the final outputs to the directory in this
path.
--evidence EVIDENCE [EVIDENCE ...], -e EVIDENCE [EVIDENCE ...]
Accepts Standard Evidence Codes outlined in
http://geneontology.org/page/guide-go-evidence-codes.
All 3 letter code for each standard evidence is
acceptable. In addition to that, EXPEC is accepted
which will pull out all annotations which are made
experimentally. COMPEC will extract all annotations
which have been done computationally. Similarly,
AUTHEC and CUREC are also accepted. Cannot be provided
if -einv is provided
--evidence_inverse EVIDENCE_INVERSE [EVIDENCE_INVERSE ...], -einv EVIDENCE_INVERSE [EVIDENCE_INVERSE ...]
Leaves out the provided Evidence Codes. Cannot be
provided if -e is provided
--aspect ASPECT [ASPECT ...], -a ASPECT [ASPECT ...]
Enter P, C or F for Biological Process, Cellular
Component or Molecular Function respectively
--assigned_by ASSIGNED_BY [ASSIGNED_BY ...], -assgn ASSIGNED_BY [ASSIGNED_BY ...]
Choose only those annotations which have been
annotated by the provided list of databases. Cannot be
provided if -assgninv is provided
--assigned_by_inverse ASSIGNED_BY_INVERSE [ASSIGNED_BY_INVERSE ...], -assgninv ASSIGNED_BY_INVERSE [ASSIGNED_BY_INVERSE ...]
Choose only those annotations which have NOT been
annotated by the provided list of databases. Cannot be
provided if -assgn is provided
--recalculate RECALCULATE, -recal RECALCULATE
Set this to 1 if you wish to enforce the recalculation
of the Information Accretion for every GO term.
Calculation of the information accretion is time
consuming. Therefore keep it to zero if you are
performing rerun on old data. The program will then
read the information accretion values from a file
which it wrote to in the previous run of the program
--info_threshold_Wyatt_Clark_percentile INFO_THRESHOLD_WYATT_CLARK_PERCENTILE, -WCTHRESHp INFO_THRESHOLD_WYATT_CLARK_PERCENTILE
Provide the percentile p. All annotations having
information content below p will be discarded
--info_threshold_Wyatt_Clark INFO_THRESHOLD_WYATT_CLARK, -WCTHRESH INFO_THRESHOLD_WYATT_CLARK
Provide a threshold value t. All annotations having
information content below t will be discarded
--info_threshold_Phillip_Lord_percentile INFO_THRESHOLD_PHILLIP_LORD_PERCENTILE, -PLTHRESHp INFO_THRESHOLD_PHILLIP_LORD_PERCENTILE
Provide the percentile p. All annotations having
information content below p will be discarded. So if 5 is provided, proteins annotated by
terms whose score is in the bottom 5% will be discarded.
--info_threshold_Phillip_Lord INFO_THRESHOLD_PHILLIP_LORD, -PLTHRESH INFO_THRESHOLD_PHILLIP_LORD
Provide a value t. All annotations having
information content below t will be discarded
--verbose VERBOSE, -v VERBOSE
Set this argument to 1 if you wish to view the outcome
of each operation on the console
--date_before DATE_BEFORE, -dbfr DATE_BEFORE
The date entered here will be parsed by the parser
from dateutil package. For more information on
acceptable date formats please visit
https://github.com/dateutil/dateutil/. All annotations
made prior to this date will be selected
--date_after DATE_AFTER, -daftr DATE_AFTER
The date entered here will be parsed by the parser
from dateutil package. For more information on
acceptable date formats please visit
https://github.com/dateutil/dateutil/. All annotations
made after this date will be selected
--single_file SINGLE_FILE, -single SINGLE_FILE
Set to 1 in order to output the results of all provided inputs into a single output file.
--select_references SELECT_REFERENCES [SELECT_REFERENCES ...], -selref SELECT_REFERENCES [SELECT_REFERENCES ...]
Provide the paths to files which contain references
you wish to select. It is possible to include
references in case you wish to select annotations made
by a few references. This will prompt the program to
interpret string which have the keywords
'GO_REF','PMID' and 'Reactome' as a GO reference.
Strings which do not contain that keyword will be
interpreted as a file path which the program will
except to contain a list of GO references. The program
will accept a mixture of GO_REF and file names. It is
also possible to choose all references of a particular
category and a handful of references from another. For
example if you wish to choose all PMID references,
just put PMID. The program will then select all PMID
references. Currently the program can accept PMID,
GO_REF and Reactome
--select_references_inverse SELECT_REFERENCES_INVERSE [SELECT_REFERENCES_INVERSE ...], -selrefinv SELECT_REFERENCES_INVERSE [SELECT_REFERENCES_INVERSE ...]
Works like -selref but does not select the references
which have been provided as input
--report REPORT, -r REPORT
Provide the path where the report file will be stored.
If you are providing a path please make sure your path
ends with a '/'. Otherwise the program will assume the
last string after the final '/' as the name of the
report file. A single report file will be generated.
Information for each species will be put into
individual worksheets.
Required arguments:
--input INPUT [INPUT ...], -i INPUT [INPUT ...]
The input file path. Please remember the name of the
file must start with goa in front of it, with the name
of the species following separated by an underscore
NOTE: Files inside the folder temp
are generated when -recal
is set to 1.
$ gothresher_prep -i ExampleData/go.obo -c gothresher.ini
This command will generate seven files in total. Three files corresponding to the three ontologies, and three files corresponding to the mapping between each GO Term and its ancestors in its own respective ontology. The last file contains mapping from alternate GO_ID to actual GO_ID. Please use this command every time you update GOFILE.
$ gothresher -cprot 100 -i ExampleData/goa_exampleYeast.gaf ExampleData/goa_exampleDicty.gaf -a C -WCTHRESHp 2 -recal 1
This command reads two input files - one for yeast and the other for
dicty. The -a C
selects the annotations which are only "CCO".
The -WCTHRESHp
argument specifies that the Wyatt Clark Threshold is 2
percent, which means all annotations having a Wyatt Clark Information
content below 2% will be removed. Instead of providing a percentage
value, user can also provide a threshold value using the argument
-WCTHRESH
. In addition to that, those annotations will be removed which
have been annotated by references that have annotated more than
100 proteins in a single paper. The output will be saved in the current directory. It is
necessary to have -recal 1
in this command since the GO Term to IC mapping has not been created yet. Subsequent runs for the same data with different threshold values is possible without providing the argument -recal
, however for new data files, use -recal 1
.
This command will generate 3 output files - one file each for the two organisms,
and the third one is a file where annotations for both the organisms are combined into a single file.
$ gothresher -i ExampleData/goa_exampleYeast.gaf ExampleData/goa_exampleDicty.gaf -a C P -PLTHRESHp 30 -e EXPEC IBA -odir ExampleData/output -single 1
This command will reads two input files, select "CCO" and "BPO"
annotations. Further, it will choose only those annotations which
have been made experimentally "EXPEC" or have been annotated computationally as
"IBA" (Inferred from Biological aspect of Ancestor). In addition to that
it will discard all annotations which have a Phillip Lord information
content less than 30%. Instead of providing a percentage value user can
also provide a threshold value using the argument -PLTHRESH
. The final
output will be generated inside the ExampleData/output
directory. User can include non
existent paths. The program will attempt to create the folders if
required permissions are present. This will lead to only one output file, since
the -single 1
argument has been provided, which will contain all the
selected annotations from both the organisms.
$ gothresher -cattn 1000 -i ExampleData/goa_exampleYeast.gaf ExampleData/goa_exampleDicty.gaf -a C P -einv COMPEC -pref testing -selrefinv Reactome
This command will read two input files, select "CCO" and "BPO" annotations. Further, it will discard those annotations which have been made computationally. The program further filters out all annotations made by "Reactome". All files will be prefixed with the string "testing". Since the program creates a meaningful name for each file, the user has been given the opportunity to give a prefix.
Unit tests are provided inside the directory tests
. Please note, if GOThresher has been directly installed from PyPi using pip
, or using conda
, user will have to download relevant files to run the test script separately.
unittest
has been built into the Python standard library, and therefore comes packaged with Python.
Download the entire GOThresher repository (Recommended):
$ git clone https://github.com/FriedbergLab/GOThresher
$ cd GOThresher/tests
OR
Download the tests directory only:
$ svn export https://github.com/FriedbergLab/GOThresher.git/trunk/tests
$ cd tests
Run the tests from within the tests
directory:
$ python test_gothresher.py
Expected output:
OK