/anc2vec

Unsupervised neural network for learning embeddings of GO terms.

Primary LanguagePythonMIT LicenseMIT

Anc2vec

Anc2vec is a novel method based on neural networks to construct embeddings of terms from the Gene Ontology (GO) exclusively using three structural features of it: the ontological uniqueness of terms, their ancestor relationships and the sub-ontology to which they belong.

This repository offers a Python package containing the source code of anc2vec, as well as instructions for reproducibility of the main results of the study where this method was proposed:

Anc2vec: embedding Gene Ontology terms by preserving ancestors relationships, by A. A. Edera, D. H. Milone, and G. Stegmayer. Research Institute for Signals, Systems and Computational Intelligence, sinc(i).

Anc2vec

Fig. 1. Panel A) The GO structure is composed by hierarchical relationships between terms arranged in three sub-ontologies: BP, CC, and MF. Panel B) Anc2vec architecture. A GO term is encoded as a one-hot vector x that is transformed into an embedding h, which is used to predict three structural features of the GO that are used for weight optimization.

Anc2Vec

Fig. 2. Anc2vec embeddings of GO terms in the three sub-ontologies. Points depict embeddings of GO terms whose colors encode the sub-ontologies: BP (Biological Process), CC (Cellular Component), and MF (Molecular Function). There is available a video showing how 2-dimensional embeddings are adjusted during weight optimization.

Requirements

Anc2vec requires Python 3.6 and TensorFlow 2.3.1.

Installation

It is recommendable to have installed Conda, to avoid Python package conflicts.

If Conda is installed, first create and activate a conda environment, for example, named anc2vec:

conda create --name anc2vec python=3.6
conda activate anc2vec

Next, install the anc2vec package via the pip package manager:

pip install -U "anc2vec @ git+https://github.com/aedera/anc2vec.git"

Anc2vec functionalities

Access pre-trained embeddings

The anc2vec package has already available the same embedding of GO terms used in the study. These embeddings were built using the Gene Ontology release 2020-10-06. The embeddings can be easily accessed on Python with this command:

import anc2vec

es = anc2vec.get_embeddings()

Here, es is a python dictionary that maps GO terms with their corresponding 200-dimensional embeddings. For example, this command uses this dictionary to retrieve the embedding corresponding to the term GO:0001780:

e = es['GO:0001780']

The variable e is a Numpy array containing the embedding

array([ 0.55203265, -0.23133564,  0.1983797 , -0.3251996 ,  0.20564775,
       -0.32133245, -0.25364587, -0.16675541, -0.46832997, -0.40702957,
       ...
       -0.29757708, -0.33143485, -0.31099185,  0.24465033, -0.25458524,
       -0.24525951, -0.366758  , -0.04628978,  0.29378492,  0.31249675],
      dtype=float32)

These anc2vec embeddings are ready to be used for semantic similarity tasks. Below there are examples showing how to use them for calculating cosine distances.

Build your own embeddings

The anc2vec package also contains a function to build embeddings from scratch using a specific OBO file, a human-readable file usually used to describe the GO. Building embeddings can be particularly useful for experimental scenarios where a specific version of the GO is required, such as those available in the GO data archive.

The following code shows how to build the embedding for a given OBO file named go.obo.

import anc2vec
import anc2vec.train as builder

es = builder.fit('go.obo', embedding_sz=200, batch_sz=64, num_epochs=100)

The object builder uses the input go.obo file to extract structural features used to build the embeddings of GO terms. Note that builder is called with additional parameters indicating the dimensionality of the embeddings (embedding_sz) and the number of optimization steps used for embedding building (num_epochs). The embeddings built by builder are stored in es, which is a Python dictionary mapping GO terms to their corresponding embeddings.

Please check the examples below for more information about this functionality.

Notebooks: examples on how to use the anc2vec package

To try anc2vec, below there are links to Jupyter notebooks that use Google Colab which offers free computing on the Google cloud.

Datasets

These are the main datasets used in the experiments of the study where anc2vec is proposed:

License

The anc2vec package is released under the MIT License.