GOThresher: a program to remove annotation biases from protein function annotation datasets

GOThresher removes annotation bias from GAF files based on GO term information content, GO evidence, annotation source, number of proteins annotated from a given source, and date. GOThresher accepts one or more GAF files as input. The motivation for GOThresher lies in the observation that protein function annotations are biased due to high throughput experimental studies (1). Removing such annotation biases can help present a more balanced picture of protein annotations for a given organism or set of proteins.

Prerequisites

Required modules:

GOThresher requires Python 3.5 or newer with the following libraries installed:

Modules can be automatically installed using pip, or obtained from their respective websites.

If GOThresher is installed using conda, none of the above pre-requisites are needed.

Required files:

GOThresher requires an obo formatted version of the Gene Ontology. Depending on your needs, this would usually be one of go-basic.obo or go.obo. For more details and to download either the most recent daily version or the latest version go to the Gene Ontology website.

Additionally, a config file that defines parameters to generate mapping files is required. This will be a .ini file, which can be downloaded from here. If user clones this repository in the following installation step, the .ini and ExampleData/go.obo files will be included.

Installation

There are three different ways to install GOThresher:

1. Installation using pip:

GOThresher is available on PyPi, which you can install using pip:

$ pip install gothresher

Do note, this approach will not download any data and config files that are available in the GitHub repository. User will have to clone this repository separately or download example data and .ini config file required to run GOThresher from Figshare.

2. Installation using conda:

GOThresher can be installed from Conda. Please note, user will have to download config and .obo files required to run GOThresher separately by cloning this GitHub repository, or from Figshare.

It is recommended to create a separate Conda environment and install GOThresher into it. This allows having the correct version of all the dependencies isolated from the system's.

$ conda create --name gth

Activate the environment:

$ conda activate gth

Install GOThresher in the isolated conda environment by running:

$ conda install -c bioconda gothresher

3. Installation from source: Alternatively, it is possible to manually download from GitHub or clone the repository using the following command:

$ git clone https://github.com/FriedbergLab/GOThresher
$ cd GOThresher

and install GOThresher by running:

$ pip install .

After installation, follow the steps described below for detailed instructions about how to use GOThresher. Alternatively, you can jump to example usage

Generate initial mapping files

GOThresher requires graphs of the three ontologies (MF, CC, BP), mapping of GO terms to all of its ancestors, and mapping of alternate GO IDs to actual GO IDs. These files can be generated by running gothresher_prep.

gothresher_prep requires a .ini file and a .obo file. These files can be downloaded into the working directory by cloning this GitHub repository, or from Figshare. Once these files are obtained in the working directory, run the following command:

Run this command only once to generate the mapping files

$ gothresher_prep -c gothresher.ini -i ExampleData/go.obo

Config file:

gothresher.ini is file that defines parameters to generate mapping files.

onto_dir: Name of output directory where the mapping files will be saved. Default is set to onto but can be changed as per user preferences.
root_bpo: GO ID for the root term of BPO graph
root_cco: GO ID for the root term of CCO graph
root_mfo: GO ID for the root term of MFO graph

Ontology:

go.obo is the ontology file supplied within the ExampleData directory.

Users can choose to download the go.obo or go-basic.obo file of their choice from http://www.geneontology.org/ontology/ instead of using our provided go.obo file. Make sure to provide the appropriate path to your .obo file while running gothresher_prep.

gothresher_prep will generate seven files in total:

Three files corresponding to the three ontologies
Three files corresponding to the mapping between each GO term and its ancestors in its own respective ontology
One file containing mapping from alternate GO_ID to actual GO_ID.

IMPORTANT: This command needs to be run again when a new version of ontology is available and updated graphs/mapping files need to be used for analysis. In that case, please use gothresher_prep after downloading a new go.obo file.

Following files will be generated within the user specified <onto_dir> folder:

1. ./<onto_dir>/alt_to_id.graph : Needed to obtain mapping from alternate GO_ID to actual GO_ID
2. ./<onto_dir>/mf.graph : The MFO Ontology graph
3. ./<onto_dir>/bp.graph : The BPO Ontology graph
4. ./<onto_dir>/cc.graph : The CCO Ontology graph
5. ./<onto_dir>/mf_ancestors.map : The MFO Ancestors map
6. ./<onto_dir>/bp_ancestors.map : The BPO Ancestors map
7. ./<onto_dir>/cc_ancestors.map : The CCO Ancestors map

Quick setup

Download the latest go.obo or go-basic.obo file from http://www.geneontology.org/ontology/
Run the program gothresher_prep and provide the downloaded obo file as well as the config file included in this repository. See the usage details here. This program needs to be run only when a new obo file needs to be used.
Run the program gothresher

GOThresher usage

A glossary and a more expanded explanation for some of the command line options are available in the Manual

usage: gothresher [-h] [--prefix PREFIX] [--cutoff_prot CUTOFF_PROT]
                 [--cutoff_attn CUTOFF_ATTN] [--output OUTPUT]
                 [--evidence EVIDENCE [EVIDENCE ...] | --evidence_inverse
                 EVIDENCE_INVERSE [EVIDENCE_INVERSE ...]] --input INPUT
                 [INPUT ...] [--aspect ASPECT [ASPECT ...]]
                 [--assigned_by ASSIGNED_BY [ASSIGNED_BY ...] |
                 --assigned_by_inverse ASSIGNED_BY_INVERSE
                 [ASSIGNED_BY_INVERSE ...]] [--recalculate RECALCULATE]
                 [--info_threshold_Wyatt_Clark_percentile INFO_THRESHOLD_WYATT_CLARK_PERCENTILE | --info_threshold_Wyatt_Clark INFO_THRESHOLD_WYATT_CLARK]
                 [--info_threshold_Phillip_Lord_percentile INFO_THRESHOLD_PHILLIP_LORD_PERCENTILE | --info_threshold_Phillip_Lord INFO_THRESHOLD_PHILLIP_LORD]
                 [--verbose VERBOSE] [--date_before DATE_BEFORE]
                 [--date_after DATE_AFTER] [--single_file SINGLE_FILE]
                 [--select_references SELECT_REFERENCES [SELECT_REFERENCES ...]
                 | --select_references_inverse SELECT_REFERENCES_INVERSE
                 [SELECT_REFERENCES_INVERSE ...]] [--report REPORT]

optional arguments:
  -h, --help            show this help message and exit
  --prefix PREFIX, -pref PREFIX
                        Add a prefix to the name of your output files.
  --cutoff_prot CUTOFF_PROT, -cprot CUTOFF_PROT
                        The threshold level for deciding to eliminate
                        annotations which come from references that annotate
                        more than the given 'threshold' number of PROTEINS
  --cutoff_attn CUTOFF_ATTN, -cattn CUTOFF_ATTN
                        The threshold level for deciding to eliminate
                        annotations which come from references that annotate
                        more than the given 'threshold' number of ANNOTATIONS
  --output OUTPUT, -odir OUTPUT
                        Writes the final outputs to the directory in this
                        path.
  --evidence EVIDENCE [EVIDENCE ...], -e EVIDENCE [EVIDENCE ...]
                        Accepts Standard Evidence Codes outlined in
                        http://geneontology.org/page/guide-go-evidence-codes.
                        All 3 letter code for each standard evidence is
                        acceptable. In addition to that, EXPEC is accepted
                        which will pull out all annotations which are made
                        experimentally. COMPEC will extract all annotations
                        which have been done computationally. Similarly,
                        AUTHEC and CUREC are also accepted. Cannot be provided
                        if -einv is provided
  --evidence_inverse EVIDENCE_INVERSE [EVIDENCE_INVERSE ...], -einv EVIDENCE_INVERSE [EVIDENCE_INVERSE ...]
                        Leaves out the provided Evidence Codes. Cannot be
                        provided if -e is provided
  --aspect ASPECT [ASPECT ...], -a ASPECT [ASPECT ...]
                        Enter P, C or F for Biological Process, Cellular
                        Component or Molecular Function respectively
  --assigned_by ASSIGNED_BY [ASSIGNED_BY ...], -assgn ASSIGNED_BY [ASSIGNED_BY ...]
                        Choose only those annotations which have been
                        annotated by the provided list of databases. Cannot be
                        provided if -assgninv is provided
  --assigned_by_inverse ASSIGNED_BY_INVERSE [ASSIGNED_BY_INVERSE ...], -assgninv ASSIGNED_BY_INVERSE [ASSIGNED_BY_INVERSE ...]
                        Choose only those annotations which have NOT been
                        annotated by the provided list of databases. Cannot be
                        provided if -assgn is provided
  --recalculate RECALCULATE, -recal RECALCULATE
                        Set this to 1 if you wish to enforce the recalculation
                        of the Information Accretion for every GO term.
                        Calculation of the information accretion is time
                        consuming. Therefore keep it to zero if you are
                        performing rerun on old data. The program will then
                        read the information accretion values from a file
                        which it wrote to in the previous run of the program
  --info_threshold_Wyatt_Clark_percentile INFO_THRESHOLD_WYATT_CLARK_PERCENTILE, -WCTHRESHp INFO_THRESHOLD_WYATT_CLARK_PERCENTILE
                        Provide the percentile p. All annotations having
                        information content below p will be discarded
  --info_threshold_Wyatt_Clark INFO_THRESHOLD_WYATT_CLARK, -WCTHRESH INFO_THRESHOLD_WYATT_CLARK
                        Provide a threshold value t. All annotations having
                        information content below t will be discarded
  --info_threshold_Phillip_Lord_percentile INFO_THRESHOLD_PHILLIP_LORD_PERCENTILE, -PLTHRESHp INFO_THRESHOLD_PHILLIP_LORD_PERCENTILE
                        Provide the percentile p. All annotations having
                        information content below p will be discarded. So if 5 is provided, proteins annotated by 
                        terms whose score is in the bottom 5% will be discarded.
  --info_threshold_Phillip_Lord INFO_THRESHOLD_PHILLIP_LORD, -PLTHRESH INFO_THRESHOLD_PHILLIP_LORD
                        Provide a  value t. All annotations having
                        information content below t will be discarded
  --verbose VERBOSE, -v VERBOSE
                        Set this argument to 1 if you wish to view the outcome
                        of each operation on the console
  --date_before DATE_BEFORE, -dbfr DATE_BEFORE
                        The date entered here will be parsed by the parser
                        from dateutil package. For more information on
                        acceptable date formats please visit
                        https://github.com/dateutil/dateutil/. All annotations
                        made prior to this date will be selected
  --date_after DATE_AFTER, -daftr DATE_AFTER
                        The date entered here will be parsed by the parser
                        from dateutil package. For more information on
                        acceptable date formats please visit
                        https://github.com/dateutil/dateutil/. All annotations
                        made after this date will be selected
  --single_file SINGLE_FILE, -single SINGLE_FILE
                        Set to 1 in order to output the results of all provided inputs into a single output file.
  --select_references SELECT_REFERENCES [SELECT_REFERENCES ...], -selref SELECT_REFERENCES [SELECT_REFERENCES ...]
                        Provide the paths to files which contain references
                        you wish to select. It is possible to include
                        references in case you wish to select annotations made
                        by a few references. This will prompt the program to
                        interpret string which have the keywords
                        'GO_REF','PMID' and 'Reactome' as a GO reference.
                        Strings which do not contain that keyword will be
                        interpreted as a file path which the program will
                        except to contain a list of GO references. The program
                        will accept a mixture of GO_REF and file names. It is
                        also possible to choose all references of a particular
                        category and a handful of references from another. For
                        example if you wish to choose all PMID references,
                        just put PMID. The program will then select all PMID
                        references. Currently the program can accept PMID,
                        GO_REF and Reactome
  --select_references_inverse SELECT_REFERENCES_INVERSE [SELECT_REFERENCES_INVERSE ...], -selrefinv SELECT_REFERENCES_INVERSE [SELECT_REFERENCES_INVERSE ...]
                        Works like -selref but does not select the references
                        which have been provided as input
  --report REPORT, -r REPORT
                        Provide the path where the report file will be stored.
                        If you are providing a path please make sure your path
                        ends with a '/'. Otherwise the program will assume the
                        last string after the final '/' as the name of the
                        report file. A single report file will be generated.
                        Information for each species will be put into
                        individual worksheets.
  
Required arguments:
  --input INPUT [INPUT ...], -i INPUT [INPUT ...]
                        The input file path. Please remember the name of the
                        file must start with goa in front of it, with the name
                        of the species following separated by an underscore

NOTE: Files inside the folder temp are generated when -recal is set to 1.

Example usage

Step 1: Generating graphs and mapping files

$ gothresher_prep -i ExampleData/go.obo -c gothresher.ini

This command will generate seven files in total. Three files corresponding to the three ontologies, and three files corresponding to the mapping between each GO Term and its ancestors in its own respective ontology. The last file contains mapping from alternate GO_ID to actual GO_ID. Please use this command every time you update GOFILE.

Step 2: Running GOThresher

$ gothresher -cprot 100 -i ExampleData/goa_exampleYeast.gaf ExampleData/goa_exampleDicty.gaf -a C -WCTHRESHp 2 -recal 1

This command reads two input files - one for yeast and the other for dicty. The -a C selects the annotations which are only "CCO". The -WCTHRESHp argument specifies that the Wyatt Clark Threshold is 2 percent, which means all annotations having a Wyatt Clark Information content below 2% will be removed. Instead of providing a percentage value, user can also provide a threshold value using the argument -WCTHRESH. In addition to that, those annotations will be removed which have been annotated by references that have annotated more than 100 proteins in a single paper. The output will be saved in the current directory. It is necessary to have -recal 1 in this command since the GO Term to IC mapping has not been created yet. Subsequent runs for the same data with different threshold values is possible without providing the argument -recal, however for new data files, use -recal 1. This command will generate 3 output files - one file each for the two organisms, and the third one is a file where annotations for both the organisms are combined into a single file.

$ gothresher -i ExampleData/goa_exampleYeast.gaf ExampleData/goa_exampleDicty.gaf -a C P -PLTHRESHp 30 -e EXPEC IBA -odir ExampleData/output -single 1

This command will reads two input files, select "CCO" and "BPO" annotations. Further, it will choose only those annotations which have been made experimentally "EXPEC" or have been annotated computationally as "IBA" (Inferred from Biological aspect of Ancestor). In addition to that it will discard all annotations which have a Phillip Lord information content less than 30%. Instead of providing a percentage value user can also provide a threshold value using the argument -PLTHRESH. The final output will be generated inside the ExampleData/output directory. User can include non existent paths. The program will attempt to create the folders if required permissions are present. This will lead to only one output file, since the -single 1 argument has been provided, which will contain all the selected annotations from both the organisms.

$ gothresher -cattn 1000 -i ExampleData/goa_exampleYeast.gaf ExampleData/goa_exampleDicty.gaf -a C P -einv COMPEC -pref testing -selrefinv Reactome

This command will read two input files, select "CCO" and "BPO" annotations. Further, it will discard those annotations which have been made computationally. The program further filters out all annotations made by "Reactome". All files will be prefixed with the string "testing". Since the program creates a meaningful name for each file, the user has been given the opportunity to give a prefix.

Unit testing

Unit tests are provided inside the directory tests. Please note, if GOThresher has been directly installed from PyPi using pip, or using conda, user will have to download relevant files to run the test script separately.

Prerequisites

Required module to run the test script:

unittest

unittest has been built into the Python standard library, and therefore comes packaged with Python.

Required files:

Download the entire GOThresher repository (Recommended):

$ git clone https://github.com/FriedbergLab/GOThresher
$ cd GOThresher/tests

Download the tests directory only:

$ svn export https://github.com/FriedbergLab/GOThresher.git/trunk/tests
$ cd tests

Run the tests from within the tests directory:

$ python test_gothresher.py

Expected output:

OK

idoerg/GOThresher

GOThresher: a program to remove annotation biases from protein function annotation datasets

Prerequisites

Required modules:

Required files:

Installation

Generate initial mapping files

Config file:

Ontology:

Quick setup

GOThresher usage

Example usage

Step 1: Generating graphs and mapping files

Step 2: Running GOThresher

Unit testing

Prerequisites

Required module to run the test script:

Required files: