This project is intended to generate high quality perturbagen signatures from LINCS L1000 assay. We build a pipeline, in parallel with L1000 group, to process raw fluorescent intensity data into z-scores as perturbagen signatures. Pre-computed datasets covering a majority of LINCS L1000 Phase I and Phase II is available in Downloads and Zenodo.
Our pipeline is different from the L1000 pipeline mostly in the peak deconvolution algorithm. We implement our algorithm in both C++ and CUDA, which can be used with various languages. We give two examples for how to use these functions with C++ natively and how to be called in Wolfram Mathematica.
Also, we have prepared a small batch of real data and relavant code for you to test our pipeline at a very small scale. You may follow the instructions, run the pipeline, and check the results.
LINCS L1000 Phase I (GSE92742) & Phase II (GSE70138) datasets generated by our pipeline are currently available. The datasets cover three levels: Our Level 4 and Level 5 data are equivalent to Level 4 and Level 5 data provided by L1000; the marginal distributions data of peak locations (GSE92742 small molecule treatments only and GSE70138) are similiar to L1000 Level 2 data, except that they are probability distributions instead of precise numbers of peak locations.
Unless you are interested in managing z-score inference and combination, we encourage you to use combined z-scores by bio-replicates (Level 5 data).
Description | Download |
---|---|
Marginal distributions of peak locations | Bayesian_GSE70138_Level2_DPEAK.zip Bayesian_GSE92742_Level2_DPEAK.zip |
Plate control z-scores | Bayesian_GSE70138_Level4_ZSPC_n335465x978.h5 Bayesian_GSE92742_Level4_ZSPC_n1093191x978.h5 |
Combined z-scores by bio-replicates | Bayesian_GSE70138_Level5_COMPZ_n116218x978.h5 Bayesian_GSE92742_Level5_COMPZ_n361481x978.h5 |
Checksum | Bayesian_L1000_sha512sum.txt |
The meta data are available from the publication by L1000 group: GSE70138 and GSE92742. They include perturbagen and cell line information associated with signature and instance IDs in the datasets.
The z-score results (as HDF5) are compatible with those published by L1000 group. Each of them contains three datasets as follows:
-
/colid
are the signature IDs (Level 5) or instance IDs (Level 4); -
/rowid
are the names of landmark genes; -
/data
are the z-scores as a matrix.
Each marginal distribution file contain the information of peak locations on one plate. It contains four datasets as follows:
-
/colid
are the instance IDs; -
/rowid
are the names of landmark genes; -
/peakloc
are the locations of the peaks for calculating likelihood function; -
/data
are encoded log-likelihoods as a rank-3 array of 16-bit unsigned integers. To retrieve the log-likelihoods, the values should be multiplied by a factor of-0.001
. Note that they are not normalized.
Qiu, Yue, et al., 2020, Bioinformatics, 36(9), 2787, https://doi.org/10.1093/bioinformatics/btaa064