/deldenoiser

Remove effects of truncated side-products from read count data of a DNA-encoded library.

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

deldenoiser

Command line tool to remove effects of truncated side-products from read count data of a DNA-encoded library (DEL) screen.

Table of Contents

Summary

Sequencing read counts from a DEL screen are used as input. The main output is the list of fitness coefficients for the compounds. For each compound, this is proportional to the surviving fraction during binding assay. The following analysis steps are carried out by deldenoiser command line tool:

  1. Estimate tag imbalance factors from pre-selection read counts. (Only if such data is available.)

  2. Estimate fitness of truncated compounds using post-selection read counts, yields and tag imbalances factors.

  3. Estimate fitness of full-cycle compounds using fitness of truncates.

  4. Estimate clean read counts, i.e. the reads originating fro the full cycle products.

It is assumed that yields of synthesis reactions are known, and the true fitness vector is sparse, i.e. only a small minority of the DEL compounds have significant binding strength.

Note: We use a microfluidics-inspired terminology and refer to the different reactions that are run in parallel in each synthesis cycle as "lanes".

Installation

Option 1: Install to local python environment (requires Python 3.6 or higher) from pypi by running

pip install deldenoiser

Option 2: Install to local python environment from github by running

git clone https://github.com/totient-bio/deldenoiser.git
pip install -e ./deldenoiser

Option 3: Build a local docker image deldenoiser:<commit_hash> by running

git clone https://github.com/totient-bio/deldenoiser.git
cd deldenoiser
make docker_image

Usage

For a complete example, see example/run_deldenoiser_command_line_tool.bash, which reads input files from example/input/ and writes results to example/output/.

Generally, running the command

deldenoiser --design <DEL_design.tsv.gz>  \
            --postselection_readcounts <readcounts_post.tsv.gz>  \
            --output_prefix <prefix> \
            [--dispersion <dispersion>] \
            [--regularization_strength <regularization_strength>] \
            [--yields <yields.tsv.gz>]  \
            [--preselection_readcount <readcounts_pre.tsv.gz>] \
            [--maxiter <maxiter>] \
            [--inner_maxiter <inner_maxiter>] \   
            [--tolerance <tol>] \
            [--parallel_processes <processes>] \
            [--minyield <minyield>] \
            [--maxyield <maxyield>] \
            [--F_init <F_init>] \
            [--max_downsteps <max_downsteps>]
            

produces 3 files,

  • <prefix>_fullcycleproducts.tsv.gz
  • <prefix>_truncates.tsv.gz
  • <prefix>_tag_imbalance_factors.tsv.gz

Inputs

  1. <DEL_design.tsv>, tab-separated values that encode the number of synthesis cycles and the number of lanes in each cycle, with two columns:

    • cycle: cycle index (1,2,... cmax)
    • lanes: number of lanes in the corresponding cycle (must be >= 1)
  2. <readcounts_post.tsv>, tab-separated values that encode the read counts obtained from sequencing done after the DEL selection steps, with cmax + 1 columns:

    • cycle_1_lane: lane index of cycle 1
    • cycle_2_lane: lane index of cycle 2
    • ...
    • cycle_<cmax>_lane: lane index of cycle cmax
    • readcount: number of reads of the DNA tag that identifies the corresponding lane index combination (non-negative integers)
  3. <prefix>, string (that can include the path) to name the output files.

Optional inputs:

  1. <dispersion>, dispersion parameter for the dispersed Poisson noise, (optional, default: 3.0)

  2. <regularization_strength>, regularization strength parameter, (optional, default: 1.0)

  3. <yields.tsv>, tab-separated values that encode the yields of the reactions during synthesis, with three columns (optional, default: all yields are set to 0.5):

    • cycle: cycle index (1,2,... cmax)
    • lane: lane index (1,2, ... [number of lanes in the corresponding cycle])
    • yield: yield of reaction in the corresponding lane (real number between 0.0 and 1.0)
  4. <readcounts_pre.tsv>, same structre as <readcounts_post.tsv>, but for reads obtained from sequencing done before the DEL selection step, (optional, default: sequencing efficiency is assumed to be uniform accross all sequences.)

  5. <maxiter>: maximum number of coordinate descent iterations during fitting truncates (default = 20)

  6. <inner_maxiter>: maximum number of iterations for each coordinate descent step during fitting truncates (default = 10)

  7. <tol>: tolerance, if the intensity due to truncates changes less than this between consecutive iterations of coordinate descent, the the fitting is stopped, before reaching maxiter number of iterations (default = 0.1)

  8. <processes>: max number of parallel processes to start during fitting truncates (default = number of system CPUs)

  9. minyield: lowest allowed input yield value, yields lower than this get censored to this level during preprocessing (default = 1e-10)

  10. maxyield: highest allowed input yield value, yields higher than this get censored to this level during preprocessing (default = 0.95)

  11. F_init: initial value for truncate fitness (default: internal guess is used)

  12. max_downsteps: max number of allowed iterations when logL is decreasing If it is reached, the optimization terminates. (default = 5)

Outputs

  1. <prefix>_fullcycleproducts.tsv.gz: tab-separated values containing the results about full-cycle products, each identified by their extended lane index combination. The cmax + 3 columns contain

    • cycle_<cid>_lane: lane index of cycle cid = 1,2,... cmax
    • fitness: fitness coefficients
    • clean_reads: posterior mode of clean reads Note: Only records corresponding to non-zero input read counts are printed in this file. Compounds with zero observed reads are implicitly assumed to have zero fitness, and zero clean reads.
  2. <prefix>_truncates.tsv.gz: tab-separated encoding the fitness coefficients of the truncates, each identified by their extended lane index combination. The cmax + 1 columns contain

    • cycle_<cid>_lane: extended lane index (which can take 0 as well, as an indication that the synthesis cycle failed) of cycle cid = 0,1,2,... cmax
    • fitness: fitness coefficient truncated compounds Note: Only records corresponding to truncates that are estimated to have non-zero fitness are printed in this file. The truncates missing from here should be understood to have zero fitness.
  3. <prefix>_tag_imbalance_factors.tsv.gz: tab-separated values containing the estimated tag imbalance factors (bhat) for each cycle and lane. It has 3 columns (the same shape as the optional <yields.tsv[.gz]> input file):

    • cycle: cycle index (1,2,... cmax)
    • lane: lane index (1,2, ... lmax[c])
    • imbalance_factor: imbalance factor of the corresponding cycle and reaction lane

Documentation

  • The publication "Denoising DNA Encoded Library Screens with Sparse Learning" by Peter Komar and Marko Kalinic provides an exposition of the assumptions behind the statistical model of deldenoiser and results of its performance of synthetic and experimental read count data.

    • Preprint on ChemRxiv
    • Peer-reviewed publication submitted to ACS Combinatorial Science
  • API documentation of deldenoiser Python package can be built by cloning the repository and running make docs command from the main directory, containing the Makefile.

  • Developer's notes can be found at development-notes/deldenoiser-development-notes.pdf