L1000 peak deconvolution based on Bayesian analysis

Overview

This project is intended to generate high quality perturbagen signatures from LINCS L1000 assay. We build a pipeline, in parallel with L1000 group, to process raw fluorescent intensity data into z-scores as perturbagen signatures. Pre-computed datasets covering a majority of LINCS L1000 Phase I and Phase II is available in Downloads and Zenodo.

Our pipeline is different from the L1000 pipeline mostly in the peak deconvolution algorithm. We implement our algorithm in both C++ and CUDA, which can be used with various languages. We give two examples for how to use these functions with C++ natively and how to be called in Wolfram Mathematica.

Also, we have prepared a small batch of real data and relavant code for you to test our pipeline at a very small scale. You may follow the instructions, run the pipeline, and check the results.

Datasets

Summary

LINCS L1000 Phase I (GSE92742) & Phase II (GSE70138) datasets generated by our pipeline are currently available. The datasets cover three levels: Our Level 4 and Level 5 data are equivalent to Level 4 and Level 5 data provided by L1000; the marginal distributions data of peak locations (GSE92742 small molecule treatments only and GSE70138) are similiar to L1000 Level 2 data, except that they are probability distributions instead of precise numbers of peak locations.

Unless you are interested in managing z-score inference and combination, we encourage you to use combined z-scores by bio-replicates (Level 5 data).

Downloads

Description	Download
Marginal distributions of peak locations	Bayesian_GSE70138_Level2_DPEAK.zip Bayesian_GSE92742_Level2_DPEAK.zip
Plate control z-scores	Bayesian_GSE70138_Level4_ZSPC_n335465x978.h5 Bayesian_GSE92742_Level4_ZSPC_n1093191x978.h5
Combined z-scores by bio-replicates	Bayesian_GSE70138_Level5_COMPZ_n116218x978.h5 Bayesian_GSE92742_Level5_COMPZ_n361481x978.h5
Checksum	Bayesian_L1000_sha512sum.txt

The meta data are available from the publication by L1000 group: GSE70138 and GSE92742. They include perturbagen and cell line information associated with signature and instance IDs in the datasets.

Data stuctures

The z-score results (as HDF5) are compatible with those published by L1000 group. Each of them contains three datasets as follows:

/colid are the signature IDs (Level 5) or instance IDs (Level 4);
/rowid are the names of landmark genes;
/data are the z-scores as a matrix.

Each marginal distribution file contain the information of peak locations on one plate. It contains four datasets as follows:

/colid are the instance IDs;
/rowid are the names of landmark genes;
/peakloc are the locations of the peaks for calculating likelihood function;
/data are encoded log-likelihoods as a rank-3 array of 16-bit unsigned integers. To retrieve the log-likelihoods, the values should be multiplied by a factor of -0.001. Note that they are not normalized.

Citation

Qiu, Yue, et al., 2020, Bioinformatics, 36(9), 2787, https://doi.org/10.1093/bioinformatics/btaa064

njpipeorgan/L1000-bayesian