/Yada

An Ensemble Based Deconvolution Algorithm

Primary LanguageJupyter Notebook

Yada Deconvolution Package.

Yada is an Python library for biological cell types deconvolution. Given gene expression data, a deconvolution algorithm is capable of estimating cell type proportions in mixe of cellss. Yada is capable of deconvoluting either by using a list of marker genes or by using a complete pure gene expression matrix. Yada offers the following novelties:

  • Performance: Yada’s results on benchmark datasets reached top results on a recent Dream challenge.
  • Flexibility: Can be used with pure gene expression matrix or with marker-genes list only. Its core algorithm supports different sequencing platforms.
  • Speed: Yada is very fast compared to other methods due to its parralel ensemble design.
  • Yada is one of the few deconvolution tools that are based on Python which is the lingua franca of data scientists today.
  • Availability: Can be run either as a Jupyter notebook or standalone script.

Resources.

Pipeline.

Yada Flow

Sample Datasets.

  • Benchmark data sets are available in the data folder (TIMER, PertU, DSA, DeconRNASeq, Abbas, BreastBlood, RatBrain, EPIC, CIBERSORT).

Requirements on Input Datasets.

  • Two files:
    • pure.csv: pure cell genes expression file. (n genes) x (k cell types)
    • mix.csv: mixtures genes expression file. (n genes) x (m mixtures).
    • Gene symbols in column 1; Mixture labels in row 1.
    • Tabular format with no missing entries.
    • It is OK if some genes are missing from the either file.
    • Data is assumed to be in non-log space (scale). If the dataset maximum expression value is less than 50, we run anti-log on all expression values.
  • Yada performs a marker gene selection algorithm and therefore typically does not use all genes in the signature matrix. If this step is not needed a simple code change should comment the relevant lines.

Running Using Jupyter Notebook on Google Colab.

Literature Overview.

Literature overview

Table: Summary of methods for cell-type deconvolution of bulk transcriptome
Show 10202550100 entries
Search:
namenumber of sourcesdatatypemethoddoiauthoryearproportions.inprofiles.inapplicationavailabilityout.profilesout.proportionscommentscategorylanguagecitationspop.indexpublishedpreviously.covered
2001 2018
0 343
0.000000000000000 85.750000000000000
name
number of sources
data
type
method
doi
author
year
proportions.in
profiles.in
application
availability
out.profiles
out.proportions
comments
category
language
citations
pop.index
published
previously.covered
2001 2018
0 343
0.000000000000000 85.750000000000000
CIBERSORT22MAsupervisedSupported vector regressionhttps://doi.org/10.1038/nmeth.3337Aaron M Newman2015FALSEFALSECancer transcriptomehttp://cibersort.stanford.edu/FALSETRUEregressionR, web tool34385.75journalTRUE
ESTIMATE2MA + RNA-seqsupervisedssGSEAhttps://doi.org/10.1038/ncomms3612Kosuke Yoshihara2013FALSEFALSECancer transcriptomehttps://sourceforge.net/projects/estimateproject/FALSETRUEpurity estimationenrichmentR26644.3333333333333journalTRUE
csSAMuser-definedMAsupervisedleast-squares regressionhttps://doi.org/10.1038/nmeth.1439Shai S Shen-Orr2010TRUEFALSEBloodhttps://github.com/shenorrLab/csSAMTRUEFALSEcomputes DEGregressionR28631.7777777777778journalTRUE
Virtual Microdissection14MAunsupervisedmultiplicative update NMFhttps://doi.org/10.1038/ng.3398Richard A Moffitt2015TRUEFALSEdetection of cancer and stroma in PDAC (TCGA)NATRUETRUEmatrix factorisationmatlab8621.5journalTRUE
Abbas regression17MAsupervisedlinear least squares regressionhttps://doi.org/10.1371/journal.pone.0006098Alexander R. Abbas2009FALSETRUEBloodNAFALSETRUEregressionR20720.7journalTRUE
MCPcounter10MAsupervisedmeans of marker geneshttps://doi.org/10.1186/s13059-016-1070-5Etienne Becht2016FALSEFALSECancer transcriptomehttps://github.com/ebecht/MCPcounterFALSETRUEenrichmentR4214journalTRUE
PSEAuser-definedMAsupervisedlinear regressionhttps://doi.org/10.1038/nmeth.1710Alexandre Kuhn2011FALSETRUEBrain tissuehttps://bioconductor.org/packages/release/bioc/html/PSEA.htmlFALSETRUEregressionR9612journalTRUE
Quadratic programming6MAsupervisedlinear latent variable model solved with quadratic programminghttps://doi.org/10.1371/journal.pone.0027156Ting Gong2011TRUETRUEBloodNAFALSETRUEregressionunknown769.5journalTRUE
Semi-supervised Nonnegative Matrix Factorizationuser-definedMAsemi-supervisedNMF minimizing the Kullback-Leibler divergence on pre-selected genes and with pure proportionshttps://doi.org/10.1016/j.meegid.2011.08.014Renaud Gaujoux2012TRUETRUEBloodhttps://web.cbio.uct.ac.za/~renaud/CRAN/web/CellMix/TRUETRUEmatrix factorisationR618.71428571428571journalTRUE
ssGSEA applied to renal cell carcinoma30RNA-seqsupervisedssGSEAhttps://doi.org/10.1186/s13059-016-1092-zYasin Şenbabaoğlu2016FALSEFALSECancer transcriptomeNAFALSETRUEused Bindea et al. Signatures, validated with FACS, CNV and methymomeenrichmentR4013.3333333333333journalTRUE
DeconRNASequndefiendRNA-seqsupervisedlinear latent variable model solved with quadratic programminghttps://doi.org/10.1093/bioinformatics/btt090Ting Gong2013TRUETRUETissue mixtureshttps://www.bioconductor.org/packages/release/bioc/html/DeconRNASeq.htmlFALSETRUEregressionR528.66666666666667journalTRUE
DSAuser-definedMAsupervisedlinear model solved with quadratic programming.https://dx.doi.org/10.1186%2F1471-2105-14-89Yi Zhong2013FALSETRUE*Cancer transcriptomehttps://github.com/zhandong/DSATRUETRUEregressionR528.66666666666667journalTRUE
DECONVOLUTEuser-definedMAsupervisedsystem of linear equationshttps://doi.org/10.1073/pnas.1832361100Peng Lu2003TRUETRUEyeast cell cyclebroken linkFALSETRUEregressionJava 21358.4375journalTRUE
ISOpure2MAsupervisedmaximum a posteriori(MAP) estimation of multinomial distributionhttps://doi.org/10.1186/gm433Gerald Quon2013FALSETRUECancer transcriptomehttps://qlab.faculty.ucdavis.edu/isopure/TRUETRUEpurity estimationprobabilisticmatlab, R447.33333333333333journalTRUE
xCell64MA + RNA-seqsupervisedssGSEA+ spillover compensationhttps://dx.doi.org/10.1186%2Fs13059-017-1349-1Dvir Aran2017FALSEFALSECancer transcriptomehttp://xcell.ucsf.edu/; https://github.com/dviraran/xCellFALSETRUEdeep deconvolutionenrichmentR, web tool157.5journalFALSE
CellCODEuser-definedMAsemi-supervisedrobust latent variablehttps://doi.org/10.1093/bioinformatics/btv015Maria Chikina2015TRUETRUE*Bloodhttp://www.pitt.edu/~mchikina/CellCODE/TRUETRUEgoal improve DEG, plugs marker-genes based covariated to the SVDmatrix factorisationR- C-C++- Fortran287journalTRUE
DCQundefiendRNA-seqsupervised regularized regression model https://dx.doi.org/10.1002%2Fmsb.134947Zeev Altboum2014TRUETRUEMice blood under flu infectionhttp://www.dcq.tau.ac.il/FALSETRUEcomparing conditions (normal vs infection)regressionweb tool326.4journalTRUE
DeMix2MAsupervisedMaximul likelihood estimatehttps://doi.org/10.1093/bioinformatics/btt301Jaeil Ahn2013FALSETRUECancer purityhttp://odin.mdacc.tmc.edu/∼wwang7/DeMix.html.TRUETRUEon log transformed dataprobabilisticC, R386.33333333333333journalTRUE
Statistical expression deconvolution2MAsupervisedlinear equationshttps://doi.org/10.1093/bioinformatics/btq097Jennifer Clarke2010FALSETRUECancer xenograftsNAFALSETRUEregressionunknown535.88888888888889journalTRUE
DSectionuser-definedMAsupervisedBayesian MCMC-based modelhttps://doi.org/10.1093/bioinformatics/btq406Timo Erkkilä2010TRUEFALSETissue mixtureshttp://informatics.systemsbiology.net/DSectionTRUEFALSEprobabilisticmatlab525.77777777777778journalTRUE
NanodissectionundefinedMAsupervisediterative SVMhttps://doi.org/10.1101/gr.155697.113Wenjun Ju2013FALSETRUEChronic kidney disease (Cell lineages)http://nano.princeton.edu/FALSETRUEidentifies genesregressionweb tool335.5journalTRUE
TIMER6MA + RNA-seqsupervisedconstrained linear regressionhttps://doi.org/10.1186/s13059-016-1028-7Bo Li2013FALSETRUECancer transcriptomehttp://cistrome.org/TIMER/FALSETRUEmade for TCGA, other application need adaptationregressionweb tool335.5journalTRUE
Direct methodSMMAunsupervisednon negative least squares and decorellationhttps://www.ncbi.nlm.nih.gov/pubmed/11473019David Venet2001TRUEFALSEcander and normal tissueNATRUETRUEmatrix factorisationunknown965.33333333333333journalTRUE
SPECuser-definedMAsupervisedenrichment scorehttps://doi.org/10.1186/1471-2105-12-258Christopher R Bolen2011FALSETRUE*Bloodhttp://clip.med.yale.edu/SPEC/FALSETRUEenrichmentR394.875journalTRUE
PERT11MAsupervisedNon-negative maximum likelihood model  with adjustement for perturbationshttps://doi.org/10.1371/journal.pcbi.1002838Wenlian Qiao2012TRUETRUEbloodhttps://github.com/gquon/PERTTRUETRUEprobabilisticoctave334.71428571428571journalTRUE
deconfuser-definedMAunsupervisedLeast squares non-negative matrix factorization algorithmhttps://doi.org/10.1186/1471-2105-11-27Dirk Repsilber2010FALSEFALSEBloodhttps://static-content.springer.com/esm/art%3A10.1186%2F1471-2105-11-27/MediaObjects/12859_2009_3484_MOESM1_ESM.ZIPTRUETRUEmatrix factorisationR414.55555555555556journalTRUE
CTenundefinedMAsupervisedGSEAhttps://doi.org/10.1186/1471-2164-13-460Jason E Shoemaker2012FALSEFALSEInfected lung tissuehttp://www.influenza-x.org/~jshoemaker/cten/TRUEFALSEfirst enrichment basedenrichmentweb tool314.42857142857143journalTRUE
Mixture models2MAsupervisedmethods of moments procedures and the expectation–maximization algorithmhttps://doi.org/10.1093/bioinformatics/bth139Debashis Ghosh2004FALSETRUECancer transcriptomebroken linkTRUETRUEimprove DEGprobabilisticR664.4journalTRUE
CAM10MAunsupervisedconvex analysis of mixtures https://doi.org/10.1038/srep18909Niya Wang2016FALSEFALSEyeast cell cyclehttp://mloss.org/software/view/437,TRUETRUEpartial profilesconvex hullR-java124journalTRUE
ISOLATEundefinedMAsupervisedLatent Dirichlet Allocation (LDA)https://dx.doi.org/10.1093%2Fbioinformatics%2Fbtp378Gerald Quon2009FALSEFALSECancer transcriptomehttps://qlab.faculty.ucdavis.edu/isolate/TRUETRUEdesigned for detection of site of originmatrix factorisationmatlab373.7journalTRUE
UNDO2MAunsupervisedmatrix inversionhttps://doi.org/10.1093/bioinformatics/btu607Niya Wang2014FALSEFALSECancer transcriptomehttps://www.bioconductor.org/packages/release/bioc/html/UNDO.htmlTRUETRUEmatrix factorisationR183.6journalTRUE
In silico microdissection3MAunsupervisedGaussian Mixture Modelhttps://doi.org/10.1186/1471-2105-6-54Harri Lähdesmäki2005FALSEFALSEIn vitro tissue mixturesNATRUETRUEprobabilisticunknown453.21428571428571journalTRUE
BioQC150MA + RNA-seqsupervisedWilcoxon-Mann-Whitney testhttps://doi.org/10.1186/s12864-017-3661-2Jitao David Zhang2017FALSEFALSEGene expressionhttps://www.bioconductor.org/packages/release/bioc/html/BioQC.htmlFALSEFALSEheterogeneity detectionenrichmentR63journalTRUE
Self-directed Method for Cell-Type IdentificationundefinedMAunsupervisednon-negative least squares, Kullback-Leibler divergencehttps://doi.org/10.1371/journal.pcbi.1003189Neta S. Zuckerman2013FALSETRUECancer transcriptomeNATRUETRUEmatrix factorisationmatlab183journalTRUE
Computational expression deconvolutionundefinedMAsupervisedlinear equations with simulated annealinghttps://doi.org/10.1186/1471-2105-7-328Min Wang2006TRUEFALSEMurine mammary glandNAFALSETRUEregressionunknown362.76923076923077journalTRUE
Electronical substraction2MAsupervisedsystem of linear equationshttps://doi.org/10.1093/bioinformatics/btm508Mark M. Gosink2007FALSETRUE*Infected macrophagesNATRUETRUEpays attention to rare cell typesregressionunknown302.5journalTRUE
EPIC8RNA-seqsupervisedweighted constrained least square optimizationhttps://dx.doi.org/10.7554%2FeLife.26476Julien Racle2017FALSEFALSECancer transcriptomehttps://github.com/GfellerLab/EPICFALSETRUEabsolute quantities, requires TPM normalized dataregressionR42journalFALSE
MMADunknownMABOTHconstrained regression with corrected AIC parameter fithttps://doi.org/10.1093/bioinformatics/btt566David A. Liebner2013FALSETRUEin vitro tissue mlxtures http://sourceforge.net/projects/mmad/TRUETRUEregressionmatlab111.83333333333333journalTRUE
Immune Quantuser-definedundefinedsupervisedDCQ algorithm adapted to humanhttps://doi.org/10.1093/bioinformatics/btw535Amit Frishberg2016TRUETRUEHuman tissueshttp://csgi.tau.ac.il/ImmQuant/FALSETRUEregressionweb tool51.66666666666667journalTRUE
VoCALuser-definedMA, GWASsupervisedlinear regressionhttps://doi.org/10.1371/journal.pcbi.1004856Yael Steuerman2016TRUETRUELung tissuehttps://cran.r-project.org/web/packages/ComICS/index.htmlFALSETRUEretruns association with eQTLSregressionR51.66666666666667journalTRUE
CellMapperuser-definedMAsemi-supervisedSVDhttps://doi.org/10.1186/s13059-016-1062-5Bradlee D. Nelms2016FALSETRUE*Brain tissuehttp://bioconductor.org/packages/release/bioc/html/CellMapper.htmlTRUEFALSEranks genes, computes p-value of specificitymatrix factorisationR51.66666666666667journalTRUE
Estimation of immune cell content8scRNA-seqsupervisedCIBERSORT using scRNA-seq basis matrixhttps://doi.org/10.1038/s41467-017-02289-3Max Schelker2017FALSEFALSECancer transcriptomeNAFALSETRUEregressionunknown31.5journalFALSE
contamDE2+RNA-seqsupervisedempirical Bayes estimate of the negative binomial dispersionhttps://doi.org/10.1093/bioinformatics/btv657Qi Shen2016TRUETRUETumor purityhttps://github.com/zhanghfd/contamDE/TRUETRUEcomputes differential gene expression staisticsprobabilisticR41.33333333333333journalTRUE
GSVA scores6RNA-seqsupervisedGSVAhttps://doi.org/10.1158/1078-0432.CCR-17-3509David Tamborero2018FALSEFALSECancer transcriptomeNAFALSETRUEenrichmentunknown11journalFALSE
Enumerateblood6MAsupervised multi-response Gaussian model trainedhttps://doi.org/10.1186/s12864-016-3460-1Casey P. Shannon2017FALSEFALSEBlood gene expressionhttps://github.com/cashoes/enumeratebloodTRUETRUEspeciffic to Affymetrix Gene ST, pre-trainedprobabilisticR21journalTRUE
CoD207RNA-seqsupervisedDCQ algorithm + random forest classifierhttps://doi.org/10.1093/bioinformatics/btv498Amit Frishberg2015TRUEFLASEMice diseased tissueshttp://www.csgi.tau.ac.il/CoD/FALSETRUEdistinguish cell types important in a diseaseregressionweb tool41journalTRUE
ImmunoStates20MAsupervisedregression (previously published)https://doi.org/10.1101/206466Francesco Vallania2017FALSEFALSEBlood, solid tissue, diseaseNAFALSETRUEregressionR10.5bioRxivFALSE
quanTIseq11RNA-seq + Imagessupervisedconstrained least squares regressionhttps://doi.org/10.1101/223180Francesca Finotello2017FALSEFALSECancer transcriptomehttp://icbi.at/software/quantiseq/doc/index.htmlFALSETRUEabsolute quantities from images, strating with raw data, returns cell densitiesregressionweb tool10.5bioRxivFALSE
SMCuser-definedMAunsupervisedBayesian inference with sequential monte carlo samplershttps://doi.org/10.1371/journal.pone.0186167Oyetunji E. Ogundijo2017FALSEFLASETissue mixtureshttps://github.com/moyanre/smcgenedeconvTRUETRUEDEG analysisprobabilisticmatlab10.5journalFALSE
Modular discrimination index5MA + RNA-seqsupervisedcorrelation-based scorehttps://doi.org/10.1371/journal.pone.0169271Gabriele Pollara2017FALSEFALSESkin tuberculosishttps://github.com/MJMurray1/MDIScoringFALSETRUEoptimisation of signature genesenrichmentR10.5journalFALSE
Robust Computational ReconstitutionSMMAsupervisedtrimmed least modulus (L1) regression.https://doi.org/10.1186/1471-2105-7-369Martin Hoffmann2006TRUETRUESynovial tissue (cell types in silico)NAFALSETRUEregressionunknown60.461538461538462journalTRUE
Statical mechanics approach2undefinedunsupervisedBayesian model with MCMC sampling assuming Gaussian distributionhttps://arxiv.org/abs/1210.7508v1Nico Riedel2013FALSEFALSEUdefinedNATRUETRUEtheoretical solutionprobabilisticunknown20.333333333333333arXivFALSE
MHMMuser-definedMAunsupervisedHidden state markov modelhttps://doi.org/10.1089/cmb.2006.13.1749Sushmita Roy2006TRUEFLASEYeast cell cycleNATRUETRUEdeaing with missign values and time dependencyprobabilisticunknown40.307692307692308journalTRUE
MySort22MAsupervised v -Support Vector Regressionhttps://doi.org/10.1186/s12859-018-2069-6Shu-Hwa Chen2018FALSEFALSEBloodhttps://testtoolshed.g2.bx.psu.edu/repository?repository_id=6e9a9ab163e578e0&changeset_revision=e3afe097e80aFALSETRUEGalaxy platform pluggable toolregressionR, web tool00journalFALSE
ADVOCATE2RNA-seqsupervisedtrained Gaussian-mixture modehttps://doi.org/10.1101/288779Jing He2018FALSETRUECancer transcriptomeNATRUETRUEprobabilisticR00bioRxivFALSE
DTDuser-definedscRNA-seqsupervisedoptimising loss-function of penalized least-squares regressionhttps://arxiv.org/abs/1801.08447v1Franziska Görtler2018TRUETRUECancer transcriptomeNAFALSETRUEregressionunknown00bioRxivFALSE
CellDistinguisher user-definedMA + RNA-sequnsupervisedtopic modelling + onvex hull / NMFhttps://doi.org/10.1371/journal.pone.0193067Lee A. Newberg2018FALSEFALSEyeast cell cyclehttps:// github.com/GeneralElectric/CcellDdistinguisherTRUETRUEpartial profilesconvex hullR00journalFALSE
dtangleuser-definedMA + RNA-seqsupervisedlinear regressionhttps://doi.org/10.1101/290262Gregory J. Hunt2018FALSETRUEBloodhttps://cran.r-project.org/package=dtangleFALSETRUEregressionR00bioRxivFALSE
DeconICA100MA + RNA-sequnsupervisedICA + post identification: enrichmenthttps://doi.org/10.5281/zenodo.1250069Urszula Czerwinska2018FALSEFALSECancer transcriptomehttps://urszulaczerwinska.github.io/DeconICA/TRUETRUEmetagene profilesmatrix factorisationR, matlab00NAFALSE
DemixT3MA + RNA-seqsupervisedMaximal likelihood estimatehttps://doi.org/10.1101/146795Zeya Wang2017FALSETRUECancer transcriptomehttps://github.com/wwylab/DeMixTTRUETRUEprobabilisticR00bioRxivFALSE
Post‐modified non‐negative matrix factorizationuser-definedRNA-sequnsupervisedNMF using alternating least squarehttps://doi.org/10.1002/cem.2929Yuan Liu2017FALSEFALSECancer transcriptomeNATRUETRUEDEG analysismatrix factorisationmatlab00journalFALSE
Infinouser-definedRNA-seqsupervisedBayesian inference with a generative modelhttps://doi.org/10.1101/221671Maxim E Zaslavsky2017FALSETRUECancer transcriptomehttps://github.com/hammerlab/infinoTRUETRUEspecalised in deep deconvolutionprobabilisticStan00bioRxivFALSE
ImSig11RNA-seqsupervisedcorrelation-based scorehttps://doi.org/10.1101/077487Ajit Johnson Nirmal2016FALSEFALSECancer transcriptomeNAFALSETRUEenrichmentunknown00bioRxivFALSE
TEMT2RNAseqsupervisedgenerative mixture model https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-S5-S11Yi Li 2013TRUETRUEin vitro cell mixtureshttps://github.com/uci-cbcl/TEMTFALSETRUEuse readsprobabilisticpython00journalTRUE
Showing 1 to 64 of 64 entries

The most popular language of implementation of published methods is R (49.2 %), followed by Matlab (11.11%), only one tool so far was published in Python.