/scAge

Profiling epigenetic age in single cells

Primary LanguageJupyter NotebookBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

image

scAge

License badge Lab badge Python badge Twitter badge

scAge is a framework for profiling epigenetic age at single-cell resolution, developed entirely in Python.

This tool, the first available single-cell clock, inherently leverages the relationship between DNA methylation in bulk samples and chronological age to predict epigenetic age in intrinsically sparse and binarized single-cell data.

This computational platform is both highly scalable and deeply flexible. It can be used on any number of cells in parallel, and may be trained on any methylation-age dataset.

To learn more about scAge, please consult the Trapp et al. paper published in Nature Aging:

Profiling epigenetic age in single cells
Alexandre Trapp, Csaba Kerepesi & Vadim N. Gladyshev
Nature Aging 2021

Remarkably, a modification of the scAge framework can also be applied to shallow methylation sequencing data, allowing for dramatic cost-effectiveness in regard to sequencing. To learn more about this application to low-pass data, please consult the other Trapp et al. preprint on bioRxiv, as well as Patrick Griffin's excellent TIME-Seq preprint:

Cost-effective epigenetic age profiling in shallow methylation sequencing data
Alexandre Trapp & Vadim N. Gladyshev
bioRxiv 2021.10.25.465778
Ultra-cheap and scalable epigenetic age predictions with TIME-Seq
Patrick Griffin, Alice Kane, Alexandre Trapp, Jien Li, Maeve McNamara, Margarita Meer,
Michael MacArthur, Sarah Mitchell, Amber Mueller, Colleen Carmody, Daniel Vera,
Csaba Kerepesi, Nicole Noren Hooten, James Mitchell, Michele Evans, Vadim N. Gladyshev & David A. Sinclair
bioRxiv 2021.10.25.465725

To learn more about the rejuvenation event discussed in the manuscript, please consult the Kerepesi et al. paper published in Science Advances, which analyzes embryonic epigenetic dynamics at the bulk level:

Epigenetic clocks reveal a rejuvenation event during embryogenesis followed by aging
Csaba Kerepesi, Bohan Zhang, Sang-Goo Lee, Alexandre Trapp & Vadim N. Gladyshev
Science Advances 7(26), 10.1126/sciadv.abg6082

Installation

A pip package version of the framework will be released shortly.

For now, to install scAge and associated data, please clone the GitHub repository:

git clone https://github.com/alex-trapp/scAge.git

This will download all required data to utilize and test the software.

Warning: downloading the full repository will take up about ~620 MB of space on your disk.

For ease of use, all functions needed to run the full scAge pipeline are directly included within the scAge.py script.

Usage

To run scAge, first add the directory containing scAge.py to your path, then import scAge into a Python script or Jupyter notebook:

import sys
sys.path.append('/path/to/dir') # where dir is a directory containing scAge.py
import scAge

In order to use the functions provided in scAge, the following packages need to be installed:

numpy (tested with version 1.20.2)
pandas (tested with version 1.2.4)
scipy (tested with version 1.6.3)
sklearn (tested with version 0.24.2)
tqdm (tested with version 4.60.0)

This tool was developed in Python 3.9.2 in an Anaconda environment, running on Ubuntu 16.04.6 LTS.

To visualize epigenetic age predictions, installation of seaborn, matplotlib and statannot is also recommended.

Predicted epigenetic age output dataframes are conveniently written to .tsv files, and can therefore also be analyzed in any other environment (R, Excel, etc...).

Example notebooks

A set of 3 example Jupyter notebooks detailing how to use the scAge tool are provided in the notebooks directory:

  • process_coverage_notebook.ipynb --> processing .cov/.cov.gz Bismark-generated files into filtered binary methylation matrices
  • construct_reference_notebook.ipynb --> constructing a reference set of linear models from a bulk methylation-age matrix
  • run_scAge_notebook.ipynb --> predicting epigenetic age in single cells, using binary single-cell matrices and a precomputed reference set

These notebooks use all the murine single cells from the Gravina et al. 2016 study, described in depth in Figure 2 of our manuscript. This includes 11 hepatocytes from young livers (4-month-old), 10 hepatocytes from old livers (26-month-old), and 5 mouse embryonic fibroblasts (MEFs). Of note, two cells (one young fibroblast and one old fibroblast) show abberant clustering profiles using dimensionality reduction approaches (see Figure 2c of their paper). These cells also have high predicted epigenetic ages. We suppose these cells may be senescent, but these results may also simply be due to technical artifacts.

All the required data to run these example scripts is provided in this repository:

  • Single-cell methylation coverage files (.cov.gz) are located in sc_data_raw
  • Processed binary methylation matrices (produced by running process_coverage on raw .cov.gz files) are located in sc_data_processed
  • Bulk data for C56BL/6J liver samples used to construct sample reference models is located in bulk
  • Processed reference matrices (produced by running construct_reference on bulk DNAm-age data) for liver, blood, muscle, and multi-tissue datasets are located in train

The functions driving scAge are documented in detail below:

Data

All data from this study is publicly available from the GEO database or the SRA.

The table below summarize where the processed methylation data can be obtained:

Study GEO Accession
Angermueller et al. 2016 GSE68642
Hernando-Herraez et al. 2019 GSE121436
Smallwood et al. 2014 GSE56879
Argelaguet et al. 2019 GSE121690

In the case of the Gravina et al. 2016 study, no deposition of processed methylation data was made to GEO. I downloaded trimmed FASTQ files from the SRA (accession SRA344045) using the fasterq-dump function from the SRA Toolkit and mapped them to the mm10/GRCm38.p6 genome using Bismark v0.22.3 with the option –non_directional, as suggested by the Bismark User Guide v0.21.0 for Zymo Pico-Methyl scWGBS library preparations. Reads were subsequently deduplicated and methylation levels for CpG sites were extracted with Bismark.

Speed, Memory Use and Parallelization

The three functions described below, construct_reference, process_coverage and run_scAge are fully functional running on a single core (the default is n_cores = 1). However, they all experience linear speedup when multiprocessing is applied. Depending on your hardware specifications, it is recommended to increase the number of cores (n_cores) to improve the speed and efficiency of the pipeline. In most of my testing, I used n_cores = 30.

process_coverage and run_scAge can consume a lot of RAM depending on how many cells are being processed simulatenously, so it is recommended to observe processes and memory allocation (i.e. via htop) when running the algorithms for the first time with new parameters/cells.

Speed and memory usage benchmarks for different datasets and parameters will be added shortly.

Training

CpG-specific linear models are first calculated using a reference bulk methylation matrix, which may contain some missing values. From our findings so far, using single tissue datasets is preferred to improve profiling accuracy, although multi-tissue datasets may also be used for training with reasonable predictions.

I provide pre-computed liver, blood, and multi-tissue training reference datasets for C57BL/6J mice inside of the train directory.

In order to train a custom set of linear models from a given DNAm matrix, run:

scAge.construct_reference(training_DNAm_matrix,
                          output_path,
                          n_cores = 1,
                          chunksize = 100)

where:

  • training_DNAm_matrix --> input bulk methylation matrix
  • output_path --> desired full path to the output reference matrix (/path/to/file.tsv)
  • n_cores --> number of cores to use for parallel processing (more cores -> faster)
  • chunksize --> number of individual CpG methylation series to distribute at once to each worker/core

This function takes as input a pandas DataFrame DNAm matrix, with rows as samples and columns as CpGs (in the form chr9_85324737).
Methylation values must be in the range from 0 (fully unmethylated) to 1 (fully methylated).
This dataframe must also contain a numeric "Age" column, which is used to compute correlations and linear regression models.
The input file format is shown below:

Sample chr9_85324737 chr15_85673573 ... Age
GSM3394403 0.328125 0.717949 ... 2
GSM3394378 0.512195 0.439024 ... 10
GSM3394354 0.767857 0.333333 ... 20

An example compressed bulk matrix of C57BL/6J mice livers (in hdf format) is provided in the bulk directory

The output of this function is a .tsv file with the following format:

ChrPos PearsonR PearsonP Coef Intercept
chr9_85324737 0.914574 4.004872e-12 0.021272 0.292949
chr15_85673573 -0.924491 8.035884e-13 -0.023755 0.671210
chr1_3037820 0.066107 0.733323 0.000766 0.723581

Loading single-cell methylomes

scAge requires binary methylation matrices as input for the core epigenetic age profiling algorithm. These binary matrices can be obtained by processing existing .cov files produced by Bismark with our function process_coverage. However, there are other tools to process FASTQ bisulfite sequencing data, such as BSSeeker and methylpy. Therefore, final single-cell methylome files may have slightly different formats depending on the tool used. The Bismark .cov file format is the following (the columns are not named in the first row of the file, they start directly with data):

Chromosome Position 1 Position 2 Methylation level Methylated counts Unmethylated counts
11 3100225 3100225 100 1 0
11 3101286 3101286 0 0 2

Native support for a variety of input formats will be added shortly. For now, I recommend processing single-cell methylation data with Bismark or converting existing methylation data to a Bismark-like format (shown above).

Most of the publicly available data used in our study were available in the .cov format, including data from Hernando-Herraez et al, Angermueller et al. and Smallwood et al.. The sequencing data from the Gravina et al. study was also processed via the Bismark pipeline. Supplementary methylation data from Argelaguet et al. contained the same information as .cov files, but the columns were labeled and in a different order. These files were handled and modified separately to ensure the required format for scAge.

Ultimately, scAge requires a .tsv or .tsv.gz file, or a pandas DataFrame formatted with two columns, as shown below.

ChrPos should be in the form 'chr_position' (i.e. 'chr1_3037802'), while MetLev should be binary (0 or 1).

ChrPos MetLev
chr1_3037802 1
chr19_61305429 0
... ...

To load and process .cov files generated by Bismark, run:

scAge.process_coverage(cov_directory,
                       output_path = "./sc_data_processed/"
                       n_cores = 1,
                       max_met = 100,
                       split = ".",
                       chunksize = 1,
                       binarization = "round")

where:

  • cov_directory --> path to the directory where .cov/.cov.gz single-cell methylation files are stored
  • output_path --> the full path to the output directory in which processed .tsv binary matrices should be written to. If output_path is set to None, named binary methylation matrices are returned in the form of a dictionary of pandas DataFrames
  • n_cores --> number of cores to use for parallel processing
  • maxmet --> maximum methylation value (normally, methylation ratios of Bismark-generated files range from 0 to 100)
  • split --> desired string to split the file name on for single-cell name generation (if split is ".", then "SRR3136624.cov" becomes "SRR3136624")
  • chunksize --> number of coverage files that will be fed into one worker process at a time
  • binarization --> choice of round vs. discard. Both methods involve dropping methylation values of 0.5: round rounds remaining non-binary values to 0 or 1 (this is the default), while discard discards remaining non-binary values. Most values are already binary in single-cell data.

Predicting epigenetic age

The core of scAge is run_scAge, a function that enables epigenetic age predictions from a processed set of binarized single-cell methylome profiles and a reference regression matrix generated from bulk data (ideally from the same tissue). To run this function:

scAge.run_scAge(single_cell_dir_or_dict,
                single_cell_set_name,
                reference_data,
                output_path = "./predictions/",
                selection_mode = "percentile",
                CpG_parameter = 1,
                zero_met_replacement = 0.001,
                one_met_replacement = 0.999,
                min_age = -20,
                max_age = 60,
                age_step = 0.1,
                n_cores = 1,
                uncertainty = 1,
                chunksize = 1)

where:

  • single_cell_dir_or_dict --> full path to a directory OR a dictionary of processed single-cell methylomes, created with process_coverage
  • single_cell_set_name --> desired name of the scAge run (i.e. 'dataset_x')
  • reference_data --> full path to the reference regression matrix created with create_reference
  • output_path --> full path to the directory where predictions and the report file should be written to
  • selection_mode --> one of (percentile (default), numCpGs, cutoff), which determines which CpG selection mode should be used in the algorithm. I recommend percentile mode, which selects the most highly-correlated CpGs while accounting for differential coverage between cells. Alternatively, a defined number of CpGs can be chosen using numCpGs mode, or a cutoff based on Pearson correlation can be set with cutoff (where only CpGs with |r| ≥ threshold are selected for the algorithm).
  • CpG_parameter --> parameter accompanying selection mode
    (1 in percentile mode --> top 1% age-correlated CpGs, default)
    (500 in numCpGs mode --> top 500 age-correlated CpGs)
    (0.7 in cutoff mode --> top age-correlated CpGs with |r| ≥ 0.7)
  • zero_met_replacement1 --> if the linear model built on bulk data goes below 0 for a given age, this value replaces the probability
  • one_met_replacement --> if the linear model built on bulk data goes above 1 for a given age, this value replaces the probability
  • min_age --> minimum age for probability computations
  • max_age --> maximum age for probability computations
  • age_step --> step for probability computations (i.e. if age_step == 1, likelihoods will be calculated for every 1 month between min_age and max_age)
  • n_cores --> number of cores that should be used for parallel processing
  • uncertainty --> the uncertainty metric that should be used to compute upper and lower bounds (higher value result in a wider interval)
  • chunksize --> number of single-cell methylomes that should be passed to each parallel worker at once

The output of run_scAge is a .tsv matrix, which contains a number of columns detailing the internal computations of the algorithm. The leftmost columns are the most critical, including:

  • PredictedAge --> scDNAm epigenetic age predictions in the units of the training data (in our case, months)
  • MeanMet --> mean global methylation of the single cell (which we show is related to epigenetic age in some cell types)
  • CellCoverage --> number of CpGs covered in the single cell (which can be used to filter predictions based on low coverage)

An example output file is shown below:

Cell PredictedAge MeanMet CellCoverage ...
SRR3136627 0.5 0.663169 3914949 ...
SRR3136659 4.0 0.683454 799350 ...
SRR3136628 25.0 0.695256 2511084 ...

Troubleshooting

If you encounter any issues when trying to run scAge, or if you have suggestions, please feel free to contact me by email: alexandre.trapp1@gmail.com. I'm always looking to improve the tool.

Information and acknowledgments

This software was developed by Alexandre (Alex) Trapp, Technical Research Assistant in the Gladyshev Lab at Harvard Medical School and Brigham and Women's Hospital. I want to acknowledge all the members of the Gladyshev Lab for their input, particularly Csaba Kerepesi and Vadim Gladyshev for their contributions to the project, as well as Tiamat Fox and Adit Ganguly for their help with schematic design.

A provisional patent application was filed by Brigham and Women's Hospital on 3/12/2021 which names Alexandre Trapp, Csaba Kerepesi, and Vadim Gladyshev as inventors of this work.