/bio-datasets

Free collection of Bio datasets and embeddings

Primary LanguagePythonApache License 2.0Apache-2.0

PyPI License Python 3.7 Code style: black Dependencies

Bio-datasets

Open-source collection of biology datasets and pre-trained embeddings. 🧬 📕

Description

bio-datasets is a collaborative framework that allows the user to fetch publicly available sequence-based protein datasets. For these datasets, pre-trained contextual embeddings are also available.

Installation

Install the required dependencies with pip install bio-datasets.

How it works

from biodatasets import list_datasets, load_dataset

print(list_datasets())

# Load your dataset
pathogen = load_dataset("pathogen")

# Display the available columns and embeddings
print(pathogen)

# Get data from your dataset
X, y = pathogen.to_npy_arrays(input_names=["sequence"], target_names=["class"])
embeddings = pathogen.get_embeddings("sequence", "protbert", "cls")

# Get a full description of your dataset
pathogen.display_description()

How to contribute

Check out how to setup the project or add a public dataset in CONTRIBUTING.md.