/ema

Primary LanguageJupyter NotebookMIT LicenseMIT

ema-tool

ema-tool is a Python library designed to facilitate the initial comparison of diverse embedding spaces in biomedical data. By incorporating user-defined metadata on the natural grouping of data points, ema-tool enables users to compare global statistics and understand the differences in clustering of natural groupings across different embedding spaces.

More information about the ema-tool can be found in our pre-print on bioRxiv: ema-tool: a Python Library for the Comparative Analysis of Embeddings from Biomedical Foundation Models.

Overview

Features

Given a set of samples and metadata, and at least two embedding spaces, the ema-tool provides visualisations to compare the following aspects of the embedding spaces:

  • Unsupervised Clusters: ema-tool provides a simple interface to cluster samples in the embedding space using the KMeans algorithm and compare against user-defined metadata.
  • Dimensionality Reduction: ema-tool allows users to reduce the dimensionality of the embedding space using PCA, t-SNE, or UMAP.
  • Pairwise Distances: ema-tool computes pairwise distances between samples in the embedding space. Different distance metrics are available, including Euclidean, Cosine, and Mahalanobis.

The following figure provides an overview of the ema-tool workflow:

ema-tool

Installation

You can install the ema library through pip, or access examples locally by cloning the github repo.

Installing the ema library

pip install ema-emb

Cloning the ema repo

git clone https://github/pia-francesca/ema

cd ema                         # enter project directory
pip3 install .                 # install dependencies
jupyter lab colab_notebooks    # open notebook examples in jupyter for local exploration

Getting Started

To get started with the ema-tool, load the metadata and embeddings, and initialize the EmbeddingHandler object. The following code snippet demonstrates how to use the ema-tool to compare two embedding spaces:

# import ema object
from ema.ema import EmbeddingHandler

# load metadata and embeddings
metadata = pd.read_csv(FP_METADATA)
emb_esm1b = np.load(FP_EMB_ESM1b)
emb_esm2 = np.load(FP_EMB_ESM2)

# initialize embedding handler
emb_handler = EmbeddingHandler(metadata)

# add embeddings to the handler
emb_handler.add_emb_space(embeddings=emb_esm1b, emb_space_name='esm1b')
emb_handler.add_emb_space(embeddings=emb_esm2, emb_space_name='esm2')

# start analysis
emb_hander.plot_emb_hist()

Colab Notebook Example

Two examples of how to use the ema-tool library is provided in the following colab notebooks:

Link Description
Open In Colab  Example of how to use the ema-tool to compare protein embeddings across three ESM models
Open In Colab  Example of how to use the ema-tool to compare embeddings of missense mutations across two ESM models

Links to Embedding Scripts

To allow a flexible use, ema-tool does not include the scripts for generating the embeddings. However, here are some links to external scripts for generating protein embeddings from fasta files using the following models:

Contact

If you have any questions or suggestions, please feel free to reach out to the authors: francesca.risom@hpi.de.

License

This project is licensed under the MIT License - see the LICENSE file for details.