bmark: A Jupyter Notebook repository from AllenInstitute

Summary

We showcase this repository as a way to conduct internal benchmarking exercises.

In particular we:

showcase a reproducible way to compile a dataset from primary sources,
provide an easy to understand summary data card for the dataset,
define a supervised classification task and performance metrics,
report results for each model (e.g. MLP-NSF model, and SCANVI) in a way that permits a fair comparisons across methods.

This effort attempts to formulate a decentralized approach to continuous benchmarking.

individual modelers are only responsible for tuning and submitting results for their own methods
new methods can be added at any time, but results can be compared in a fair manner to previously submitted methods
all reporting (data + model cards) is intended to be succinct yet accessible to a non-expert

We hope that this format can prevent misrepresentation of methods, have archival value through a minimal but necessary level of reproducibility, and provide the basis for informed decision making regarding choice of methods for specific tasks.

References

[1] 2021 Luecken et al.
[2] 2019 Mitchell et al.
[3] 2021 Gebru et al.

Pilot benchmark dataset

We identified the 10Xv3 Mouse M1 data single nucleus data generated by Allen Institute as a candidate benchmark dataset. The relevant links are below:

Updated taxonomy: Links to dendrograms and hierarchy
BDS google drive: Count data here is expected to match version deposited to NeMO. Metadata incorporates updates to taxonomy (compared to NeMO version)
NeMO archive: Files for the benchmark are under Analysis->BICCN_MOp_snRNA_10X_v3_Analysis_AIBS
Cell type explorer link: Summary of the different Mouse M1 datasets in lower-left panel.

Environment

conda create -n bmark
conda activate bmark
conda install python==3.8
conda install seaborn scikit-learn statsmodels numba pytables
conda install -c conda-forge python-igraph leidenalg
pip install scanpy
pip install gdown timebudget autopep8 toml sklearn
pip install jupyterlab
pip install -e .

Pilot dataset

# Download
source scripts/download_scripts.sh
get_bmark_pilot /allen/programs/celltypes/workgroups/mousecelltypes/benchmarking/dat/pilot/

# Processing raw data with codes in ./scripts
python -m make_pilot_h5ad --data_path /allen/programs/celltypes/workgroups/mousecelltypes/benchmarking/dat/pilot --min_sample_thr 20 --write_h5ad 1
python -m make_pilot_markers --data_path /allen/programs/celltypes/workgroups/mousecelltypes/benchmarking/dat/pilot --write_csv 1

Config

Create a config.toml file at the repository root with appropriate data_dir path:

['pilot']
data_dir = '/allen/programs/celltypes/workgroups/mousecelltypes/benchmarking/dat/pilot/'

config.toml is accessed through load_config in bmark.utils.config.
Use config.toml to include any other hardcoded paths, needed for notebooks/ scripts to work correctly.

Contributors

Rohan Gala, Nelson Johansen, Raymond Sanchez, Kyle Travaglini

AllenInstitute/bmark