Multimodal Universe: Enabling Large-Scale Machine Learning with 70TBs of Astronomical Scientific Data

Overview

The Multimodal Universe dataset is a large scale collection of multimodal astronomical data, including images, spectra, and light curves, which aims to enable research into foundation models for astrophysics and beyond.

Quick Start

All datasets can be previewed directly from our HuggingFace hub and accessed via load_dataset('MultimodalUniverse/dataset_name')! Preview datasets include ~1k examples from each survey.

from datasets import load_dataset

dset = load_dataset('MultimodalUniverse/plasticc', 
                    split='train', streaming=True)

example = next(iter(dset))

You can try this out with our getting started notebook!

Data Access

To access the full dataset, we recommend downloading the data locally. This is necessary for using the provided cross-matching utilities.

The full dataset content is hosted at the Flatiron Institute and available either through HTTPS or through GLOBUS:

GLOBUS is much preferable when downloading large amounts of data, or a large number of files. Local download of the full data in its native HDF5 format is necessary for using the provided cross-matching utilities.

After downloading the data, you can use Hugging Face's datasets library to load the data directly from your local copy. For example, to load the PLAsTiCC dataset:

from datasets import load_dataset

dset = load_dataset('path/to/downloaded/plasticc', 
                    split='train', streaming=True)
dset = dset.with_format('numpy')

example = next(iter(dset))

Datasets

The Multimodal Universe currently contains data from the following surveys/modalities:

Survey	Modality	Science Use Case	# samples
Legacy Surveys DR10	Images	Galaxies	124M
Legacy Surveys North	Images	Galaxies	15M
HSC	Images	Galaxies	477k
BTS	Images	Supernovae	400k
JWST	Images	Galaxies	300k
Gaia BP/RP	Spectra	Stars	220M
SDSS-II	Spectra	Galaxies, Stars	4M
DESI	Spectra	Galaxies	1M
APOGEE SDSS-III	Spectra	Stars	716k
GALAH	Spectra	Stars	325k
Chandra	Spectra	Galaxies, Stars	129k
VIPERS	Spectra	Galaxies	91k
MaNGA SDSS-IV	Hyperspectral Image	Galaxies	12k
PLAsTiCC	Time Series	Time-varying objects	3.5M
TESS	Time Series	Exoplanets	160k
CfA Sample	Time Series	Supernovae	1k
YSE	Time Series	Supernovae	2k
PS1 SNe Ia	Time Series	Supernovae	369
DES Y3 SNe Ia	Time Series	Supernovae	248
SNLS	Time Series	Supernovae	239
Foundation	Time Series	Supernovae	180
CSP SNe Ia	Time Series	Supernovae	134
Swift SNe Ia	Time Series	Supernovae	117
Gaia	Tabular	Stars	220M
PROVABGS	Tabular	Galaxies	221k
Galaxy10 DECaLS	Tabular	Galaxies	15k

We are accepting new datasets! Check out our contribution guidelines for more details.

Data License

We openly distribute the Multimodal Universe dataset under the Creative Commons Attribution (CC BY) 4.0 license, noting however that when using specific subsets, the license and conditions of utilisation should be respected.

Architecture

Illustration of the methodology behind the Multimodal Universe. Domain scientists with expertise in a given astronomical survey provide data download and formatting scripts through Pull Requests. All datasets are then downloaded from their original source and made available as Hugging Face datasets sharing a common data schema for each modality and associated metadata. End-users can then generate any combination of subsets using provided cross-matching utilities to generate multimodal datasets.

Please see the Design Document for more context about the project.

Contributors

Full Contribution List

_{Francois Lanusse} 📆 💡 💻	_{Liam Parker} 📆 💡 💻	_{Micah Bowles} 📆 💡 💻	_{mhuertascompany} 📆 💡 💻	_{Mike Smith} 📆 💡 💻	_{Helen Qu} 📆 💡 💻	_Aaron 💡 💻
_{Ben Boyd} 💡 💻	_{Brian Cherinka} 💻	_{Connor Stone, PhD} 💡	_{David Chemaly} 💡 💻	_{Erin Hayes} 💡 💻	_{Henry Leung} 💻	_{Ioana Ciucă} 🖋
_{Jeff Shen} 💻	_jeraud 💡 💻	_{John F. Wu} 🖋	_{CambridgeAstroStat} 🧑‍🏫	_{Kartheik Iyer} 💻	_{Lucas Meyer} 💻	_{Matthew Grayling} 💡 💻
_{Maja Jabłońska} 💻	_{Mike Walmsley} 💡 💻	_{Miles Cranmer} 🖋	_{Peter Melchior} 💻	_{Rafael Martínez-Galarza} 💻	_{Tom Hehir} 💡 💻	_{Shirley Ho} 🔍 🖋