Multimodal Universe: Enabling Large-Scale Machine Learning with 70TBs of Astronomical Scientific Data
The Multimodal Universe dataset is a large scale collection of multimodal astronomical data, including images, spectra, and light curves, which aims to enable research into foundation models for astrophysics and beyond.
All datasets can be previewed directly from our HuggingFace hub and accessed via load_dataset('MultimodalUniverse/dataset_name')
!
Preview datasets include ~1k examples from each survey.
from datasets import load_dataset
dset = load_dataset('MultimodalUniverse/plasticc',
split='train', streaming=True)
example = next(iter(dset))
You can try this out with our getting started notebook!
To access the full dataset, we recommend downloading the data locally. This is necessary for using the provided cross-matching utilities.
The full dataset content is hosted at the Flatiron Institute and available either through HTTPS or through GLOBUS:
- https://users.flatironinstitute.org/~flanusse/MultimodalUniverse
- https://app.globus.org/file-manager?origin_id=58a4d334-d750-454d-88a3-9d8256d091a6
GLOBUS is much preferable when downloading large amounts of data, or a large number of files. Local download of the full data in its native HDF5 format is necessary for using the provided cross-matching utilities.
After downloading the data, you can use Hugging Face's datasets
library to load the data directly from your local copy. For example, to load the PLAsTiCC dataset:
from datasets import load_dataset
dset = load_dataset('path/to/downloaded/plasticc',
split='train', streaming=True)
dset = dset.with_format('numpy')
example = next(iter(dset))
The Multimodal Universe currently contains data from the following surveys/modalities:
Survey | Modality | Science Use Case | # samples |
---|---|---|---|
Legacy Surveys DR10 | Images | Galaxies | 124M |
Legacy Surveys North | Images | Galaxies | 15M |
HSC | Images | Galaxies | 477k |
BTS | Images | Supernovae | 400k |
JWST | Images | Galaxies | 300k |
Gaia BP/RP | Spectra | Stars | 220M |
SDSS-II | Spectra | Galaxies, Stars | 4M |
DESI | Spectra | Galaxies | 1M |
APOGEE SDSS-III | Spectra | Stars | 716k |
GALAH | Spectra | Stars | 325k |
Chandra | Spectra | Galaxies, Stars | 129k |
VIPERS | Spectra | Galaxies | 91k |
MaNGA SDSS-IV | Hyperspectral Image | Galaxies | 12k |
PLAsTiCC | Time Series | Time-varying objects | 3.5M |
TESS | Time Series | Exoplanets | 160k |
CfA Sample | Time Series | Supernovae | 1k |
YSE | Time Series | Supernovae | 2k |
PS1 SNe Ia | Time Series | Supernovae | 369 |
DES Y3 SNe Ia | Time Series | Supernovae | 248 |
SNLS | Time Series | Supernovae | 239 |
Foundation | Time Series | Supernovae | 180 |
CSP SNe Ia | Time Series | Supernovae | 134 |
Swift SNe Ia | Time Series | Supernovae | 117 |
Gaia | Tabular | Stars | 220M |
PROVABGS | Tabular | Galaxies | 221k |
Galaxy10 DECaLS | Tabular | Galaxies | 15k |
We are accepting new datasets! Check out our contribution guidelines for more details.
We openly distribute the Multimodal Universe dataset under the Creative Commons Attribution (CC BY) 4.0 license, noting however that when using specific subsets, the license and conditions of utilisation should be respected.
Illustration of the methodology behind the Multimodal Universe. Domain scientists with expertise in a given astronomical survey provide data download and formatting scripts through Pull Requests. All datasets are then downloaded from their original source and made available as Hugging Face datasets sharing a common data schema for each modality and associated metadata. End-users can then generate any combination of subsets using provided cross-matching utilities to generate multimodal datasets.
Please see the Design Document for more context about the project.