/bmark

Pilot for single cell benchmarking datasets

Primary LanguageJupyter Notebook

Summary

We showcase this repository as a way to conduct internal benchmarking exercises.

In particular we:

  1. showcase a reproducible way to compile a dataset from primary sources,
  2. provide an easy to understand summary data card for the dataset,
  3. define a supervised classification task and performance metrics,
  4. report results for each model (e.g. MLP-NSF model, and SCANVI) in a way that permits a fair comparisons across methods.

This effort attempts to formulate a decentralized approach to continuous benchmarking.

  • individual modelers are only responsible for tuning and submitting results for their own methods
  • new methods can be added at any time, but results can be compared in a fair manner to previously submitted methods
  • all reporting (data + model cards) is intended to be succinct yet accessible to a non-expert

We hope that this format can prevent misrepresentation of methods, have archival value through a minimal but necessary level of reproducibility, and provide the basis for informed decision making regarding choice of methods for specific tasks.

References

[1] 2021 Luecken et al.
[2] 2019 Mitchell et al.
[3] 2021 Gebru et al.


Pilot benchmark dataset

We identified the 10Xv3 Mouse M1 data single nucleus data generated by Allen Institute as a candidate benchmark dataset. The relevant links are below:

  • Updated taxonomy: Links to dendrograms and hierarchy
  • BDS google drive: Count data here is expected to match version deposited to NeMO. Metadata incorporates updates to taxonomy (compared to NeMO version)
  • NeMO archive: Files for the benchmark are under Analysis->BICCN_MOp_snRNA_10X_v3_Analysis_AIBS
  • Cell type explorer link: Summary of the different Mouse M1 datasets in lower-left panel.

Environment

conda create -n bmark
conda activate bmark
conda install python==3.8
conda install seaborn scikit-learn statsmodels numba pytables
conda install -c conda-forge python-igraph leidenalg
pip install scanpy
pip install gdown timebudget autopep8 toml sklearn
pip install jupyterlab
pip install -e .

Pilot dataset

# Download
source scripts/download_scripts.sh
get_bmark_pilot /allen/programs/celltypes/workgroups/mousecelltypes/benchmarking/dat/pilot/

# Processing raw data with codes in ./scripts
python -m make_pilot_h5ad --data_path /allen/programs/celltypes/workgroups/mousecelltypes/benchmarking/dat/pilot --min_sample_thr 20 --write_h5ad 1
python -m make_pilot_markers --data_path /allen/programs/celltypes/workgroups/mousecelltypes/benchmarking/dat/pilot --write_csv 1

Config

Create a config.toml file at the repository root with appropriate data_dir path:

['pilot']
data_dir = '/allen/programs/celltypes/workgroups/mousecelltypes/benchmarking/dat/pilot/'
  • config.toml is accessed through load_config in bmark.utils.config.
  • Use config.toml to include any other hardcoded paths, needed for notebooks/ scripts to work correctly.

Contributors

Rohan Gala, Nelson Johansen, Raymond Sanchez, Kyle Travaglini