Welcome to CRFE! This user manual provides a comprehensive guide to using this software effectively.
CRFE provides a high-level interpretation of high-throughput gene expression data. For this type of analysis, the user possesses a ranked list of genes, ranked f.e. by differential expression between treatment and control samples. The genes above a certain threshold are considered differentially expressed or perturbed, all other genes form the background or the unperturbed genes. Given categories that describe the biological functions of different sets of genes (e.g., biological process terms and annotations from the Gene Ontology), we can ask the question of which functional categories best explain the observed gene list. A good collection of functional categories explains as many as possible perturbed genes, and as few as possible unperturbed genes. In particular, particularly many highly perturbed genes should be explained and the collection should consist of only a small number of specific categories. CRFE returns a list of functional categories that satisfies all these requirements. Investigating this list is a good first step to interpreting the results of any high-throughput gene expression experiment. CRFE can also be used in other omics functional enrichment analyses, including genome, epigenome, proteome, and metabolome.
If you use CRFE in your work, please cite the CRFE paper: https://www.biorxiv.org/content/10.1101/2023.06.30.547164v1.
Feel free to link to CRFE in your Methods: https://github.com/ckadelka/CRFE
A working Python distribution and the program crfe.py
are required to use CRFE. The program has been developed and tested extensively under Python 3.10 but should work with lower Python 3.x versions as well. It requires the packages, sys
, math
, random
, os
, datetime
, time
, numpy
, scipy
, pandas
, pickle
, and optparse
. The correct setup can be tested by executing the example below and comparing the output to the provided output.
CRFE requires two input file: an annotation file (describing categories with gene annotations such as Gene Ontology Annotation File or KEGG pathway) and a list of genes to be enriched (containing gene names and their corresponding levels or ranking).
- Annotation file is a tab-separated file with two columns: the first column contains gene categories and the second column contains all gene annotations in that category, whitespace-separated. Below is an example of a valid annotation file:
reproduction | A1CF A2M AAAS ABAT ABCC8 ABHD2 ACE |
developmental growth | ABL1 ACVR1C ACVR2B ADAM15 ADARB1 ADM |
- Gene file is a tab-separated file with two columns: the first column contains gene names (using the same naming system as the gene names in the annotation file) and the second column contains numeric values upon which the list of genes will be ranked (in descending order by default). If the values of the genes are not provided, the gene order in the first column will be used.
We also provide a Python script to get the required annotation file using a Gene Ontology Annotation File (GAF) as the input. If users wish to parse a GAF to a valid annotation file for CRFE, an OBO file is needed and can be retrieved from GO Consortium.
To run the parse_DAG.py
script, use the following command:
python parse_DAG.py -i <path to obo file> -ont <ontology to use> -g <path to GAF> -o <output directory>
List of arguments used in parse_DAG.py
:
-h, --help show this help message and exit
--inputobo INPUTOBO, -i INPUTOBO
An .obo file of the Gene Ontology, preferably obtained
from the same release as the Gene Annotation file .gaf to avoid obsolete terms.
--ontology ONTOLOGY, -ont ONTOLOGY
Which Gene Ontology namespace to use, one of 'F','P', or 'C'
(corresponds to molecular_function, biological_process, or cellular_component, respectively),
default to 'P'
--gafinput GAFINPUT, -g GAFINPUT
A Gene Ontology annotation file (.gaf), preferably obtained from
the same release as the Gene Ontology structure .obo file to avoid obsolete annotations
--odir ODIR, -o ODIR Annotation file output directory
The script creates a directed acyclic graph using the provided obo
file in the chosen ontology (Molecular Function, Biological Process, or Cellular Component). Any gene annotated to a Gene Ontology (GO) term is also annotated to the ancestors of that term; therefore, the gene annotations in GAF are propagated using the constructed graph. In addition, GO terms with exactly similar gene annotations are combined. As a result, the script outputs two annotation files in the specified output directory which differ in the format of GO terms, one with GO term ID (e.g., GO:0000003), one with GO term name (e.g., reproduction). The gene annotations (the second column) are similar in these two files.
CRFE accepts different types of command line options. See the table below or type crfe.py -h
for details about all available command line options.
A typical invocation of CRFE will look like this:
python crfe.py -a <annotations file> -g <gene file> -i test
We recommend using CRFE with user-defined parameters based on the data. Specifically, users should set the following parameters:
-T THRESHOLD_TYPE
and-t THRESHOLD
: These parameters should be adjusted based on which genes are considered perturbed.-b BELIEF
: This parameter determines the emphasis CRFE places on explaining the top-ranked genes.-n REPEATS
: The MCMC optimization of CRFE is stochastic. This parameter specifies the number of repeated CRFE runs.
By customizing these parameters, users can optimize the performance of CRFE for their particular use case.
To test CRFE on your local machine, please download the annotation file GOhuman_ft_named.txt
and the sample data GSE87340_tumor_normal_log2fc-overexp.txt
from the data folder.
To run GSEA with the default settings, use the following command line:
python crfe.py -a GOhuman_propagate_combined.txt -g GSE87340_tumor_normal_log2fc-overexp.txt
Once executed, CRFE will create a saved_data directory where it will store information about the annotation file and the input gene list for faster reference. The result text file generated by CRFE will be saved in the output directory.
The result text file consists of two parts: The first part records all the parameters of the current run for reproducibility. The second part is a table that presents all enriched functional categories (rows, ranked by mean posterior probability), as well as several statistics for each category (columns).
Statistic | Description |
---|---|
Order | the rank of the functional category based on its mean posterior probability |
#(Annotations) | the number of elements (genes) annotated to the functional category |
#(Annotations of perturbed genes) | the number of perturbed elements (genes) annotated to the functional category |
Mean Posterior Probability | the average posterior probability across all repeats |
Standard Deviation Posterior Probability | The standard deviation of the posterior probability across all repeats |
p-value | the one-tailed p-value from the hypergeometric distribution (values < 10^{-16} are 0) |
Average Expression Level | the average expression level (if provided) of all annotated elements (genes) |
Average position perturbed genes | the average position of all annotated perturbed elements (genes), compared to all perturbed genes |
Term Name | identifier or name of the category |
Options:
-h, --help show this help message and exit
-a ANNOTATIONS_FILE, --annotations_file=ANNOTATIONS_FILE
Annotation data (csv or txt). Each row contains the
term name followed by a tab, followed by all annotated
genes separated by space. No header!
-g GENE_FILE, --gene_file=GENE_FILE
Expression data (csv or txt). Each row contains the
gene name, optionally followed by tab plus expression
level. No header!
-t THRESHOLD, --threshold=THRESHOLD
If threshold_type==proportion, this describes the
proportion of genes considered perturbed, otherwise it
describes the threshold in the expression values to be
used (default 0.3)
-T THRESHOLD_TYPE, --threshold_type=THRESHOLD_TYPE
Whether threshold should be interpreted as 1. a value,
2. proportion ('proportion') of perturbed genes, or 3.
different levels of genes already provided in the
expression data ('levels'), (default proportion)
-c LOWER_CUTOFF, --lower_cutoff=LOWER_CUTOFF
Only terms with at least this many annotations are
considered (default 20)
-C UPPER_CUTOFF, --upper_cutoff=UPPER_CUTOFF
Only terms with at most this many annotations are
considered. Use 0 to excluded an upper cutoff (default
500).
-b BELIEF, --belief=BELIEF
Belief for how much more active the highest gene set
is compared to the least active gene set (default 5)
-n REPEATS, --repeats=REPEATS
Number of independent repeats of CRFE (default 1).
-s BURNIN_STEPS, --burnin_steps=BURNIN_STEPS
Length of (unrecorded) burnin period of MCMC
simulation. Used to initialize the Markov chain
(default 50000)
-S MCMC_STEPS, --MCMC_steps=MCMC_STEPS
Length of (recorded) steps in the MCMC simulation
after the (unrecorded) burnin period (default 50000)
-x ALPHA_BETA_MAX, --alpha_beta_max=ALPHA_BETA_MAX
Maximal value for the false positive and false
negative rate that can be learned (default 0.5).
Should not be changed. CRFE already automatically
chooses the false poositive and negative rates such
that (i) an explained perturbed gene is never
penalized more than an unexplained perturbed gene, and
that (ii) an unexplained unperturbed gene is never
penalized more than an explained unperturbed gene.
-X NUMBER_DIFFERENT_FALSE_RATES, --number_different_false_rates=NUMBER_DIFFERENT_FALSE_RATES
Number of different possible values for the false
positive and false negative rate (default 20).
-Q NUMBER_DIFFERENT_PENALIZATIONS, --number_different_penalizations=NUMBER_DIFFERENT_PENALIZATIONS
Number of different possible values for the term size
penalization parameter q (default 20)
-p PROBABILITY_PARAMETER_CHANGE, --probability_parameter_change=PROBABILITY_PARAMETER_CHANGE
Proportion of time parameters are changed in MCMC
(default 0.2), if 0 no parameters are being learnt.
-r PARAMETER_INITIAL_MCMC_SET, --parameter_initial_MCMC_set=PARAMETER_INITIAL_MCMC_SET
When nonzero, the algorithm does not start with an
empty set. If r>=1, r random processes are chosen, if
0<r<1, each process is in the initial set with
probability r (default 0, i.e., empty set)
-o OUTPUT_FOLDER, --output_folder=OUTPUT_FOLDER
Local reference to folder where output is saved
(default: output/)
-i IDENTIFIER, --identifier=IDENTIFIER
Identifying string used to distinguish different
experiments (empty default)
-v If True, more information is printed during the CRFE
run (slows down CRFE).
-R SEED, --seed=SEED If nonnegative, this seed is used too initialize the
random number generators (default: -1, i.e. random
seed)
Please contact Xinglin Jia (xjia@iastate.edu), An Phan (ahphan@iastate.edu), or Claus Kadelka (ckadelka@iastate.edu) with any questions about this program.