CRISPR Bayesian Estimation of variant effect (from Base Editing reporter screens) with guide Activity Normalization
This is an analysis toolkit for the pooled CRISPR reporter or sensor data. The reporter technique transfects cells with plasmid with not only sgRNA but with the target sequence surrogate which we call reporter or sensor.
crispr-bean
supports the following functionalities.
bean-count
,bean-count-sample
: Base-editing-aware mapping of guide, optionally with reporter from.fastq
files.bean-qc
: Quality control report and filtering out / masking of aberrant sample and guidesbean-filter
: Filter reporter allelesbean-run
: Quantify targeted variants' effect sizes from screen data.
First install pytorch. Then download from PyPI:
pip install crispr-bean[model]
pip install crispr-bean
This wouldn't have variant effect size quantification (bean-run
) functionality.
bean-count-samples
or bean-count
maps guide into guide counts, allowing for base transition in spacer sequence. When the matched reporter information is provided, it can count the target site edits and alleles produced by each guide. Mapping is efficiently done based on CRISPResso2 modified for base-edit-aware mapping.
bean-count-samples \
--input sample_list.csv `# sample with lines 'R1_filepath,R2_filepath,sample_name\n'` \
-b A `# base that is being edited (A/G)` \
-f sgRNA_info_table.csv `# sgRNA information` \
-o . `# output directory` \
-r `# read edit/allele information from reporter` \
-t 12 `# number of threads` \
--name my_sorting_screen `# name of this sample run`
File should contain following columns.
name
: gRNA ID columnsequence
: gRNA sequencebarcode
: R2 barcode to help match reporter to gRNAstrand [Optional]
: Specifies gRNA strand information relative to the reference genome.start_pos [Optional]
: gRNA starting position in the genome. Required when you providestrand
column. Should specify the smaller coordinate value among start and end position regardless of gRNA strandedness.offset [Optional]
: Specifies the absolute positional offset to be added to the edited position. Useful when you need amino acid translation results for ex. coding sequence tiling screens.target_pos [Optional]
: If--match_target_pos
flag is used, input file needstarget_pos
which specifies 0-based relative position of targeted base within Reporter sequence.- In order to use accessibility in the variant effect quantification, provide accessibility information in one of two options. (For non-targeting guides, provide NA values (empty cell).)
- Option 1:
chr
&genomic_pos
: Chromosome (ex.chr19
) and genomic position of guide sequence. You will have to provide the path to the bigwig file with matching reference version inbean-run
. - Option 2:
accessibility_signal
: ATAC-seq signal value of the target loci of each guide.
- Option 1:
File should contain following columns with header.
R1_filepath
: Path to read 1.fastq[.gz]
fileR2_filepath
: Path to read 1.fastq[.gz]
filesample_id
: ID of sequencing samplerep [Optional]
: Replicate # of this samplebin [Optional]
: Name of the sorting binupper_quantile [Optional]
: FACS sorting upper quantilelower_quantile [Optional]
: FACS sorting lower quantile
Optional columns are not required but can be provided for compatibility with bean-qc
and bean-run
.
count
or count-samples
produces .h5ad
and .xlsx
file with guide and per-guide allele counts.
.h5ad
: This output file follows annotated matrix format compatible withAnnData
and is based onScreen
object in purturb_tools. The object contains the per-guide allele counts..guides
: guide information provided in input (gRNA_library.csv
in above example).samples
: sample information provided in input (sample_list.csv
in above example).X
: Main guide count matrix, where row corresponds to each guide in.guides
and columns correspond to samples in.samples
. Following attributes are included if matched reporter is provided and you chose to read edit/allele information from the reporter using-r
option..X_bcmatch [Optional]
: Contains information about number of barcode-matched reads. Information about R2 barcode should be specified asbarcode
column in yourgRNA_library.csv
file..X_edits [Optional]
: If target position of each guide is specified astarget_pos
in inputgRNA_library.csv
file and--match-target-position
option is provided, the result has the matrix with the number of target edit at the specified positions..allele_tables [Optional]
: Dictionary with a single allele count table that counts per guide and allele combination, what is the count per sample.
.xlsx
: This output file contains.guides
,.samples
,.X[_bcmatch,_edits]
. (allele_tables
are often too large to write into an Excel!)
bean-qc my_sorting_screen.h5ad -o my_sorting_screen_masked.h5ad -r qc_report_my_sorting_screen
bean-qc
supports following quality control and masks samples with low quality. Specifically:
- Plots guide coverage and the uniformity of coverage
- Guide count correlation between samples
- Log fold change correlation when positive controls are provided
- Plots editing rate distribution
- Identify samples with low guide coverage/guide count correlation/editing rate and mask the sample in
bdata.samples.mask
- Identify outlier guides to filter out
Above command produces
my_sorting_screen_masked.h5ad
without problematic replicate and guides and with sample masks, andqc_report_my_sorting_screen.[html,ipynb]
as QC report.
--replicate-label
(default:"rep"
): Label of column inbdata.samples
that describes replicate ID.--condition-label
(default:"bin"
)": Label of column inbdata.samples
that describes experimental condition. (sorting bin, time, etc.).--target-pos-col
(default:"target_pos"
): Target position column inbdata.guides
specifying target edit position in reporter.--rel-pos-is-reporter
(default:False
): Specifies whetheredit_start_pos
andedit_end_pos
are relative to reporter position. IfFalse
, those are relative to spacer position.--edit-start-pos
(default:2
): Edit start position to quantify editing rate on, 0-based inclusive.--edit-end-pos
(default:7
): Edit end position to quantify editing rate on, 0-based exclusive.--count-correlation-thres
(default:0.8
): Threshold of guide count correlation to mask out.--edit-rate-thres
(default:0.1
): Median editing rate threshold per sample to mask out.--posctrl-col
(default:group
): Column name in .h5ad.guides DataFrame that specifies guide category.--posctrl-val
(default:PosCtrl
): Value in .h5ad.guides[posctrl_col
] that specifies guide will be used as the positive control in calculating log fold change.--lfc-thres
(default:0.1
): Positive guides' correlation threshold to filter out.
bean-run variant[tiling] my_sorting_screen_masked.h5ad --scale-by-acc --acc-bw-path accessibility_signal.bw -o output_prefix/ --fit-negctrl
Above command produces
output_prefix/bean_element_result.[model_type].csv
with following columns:mu
: Mean of variant phenotype, given the wild type has standard normal phenotype distribution ofmu = 0, sd = 1
.mu_sd
: Mean of variant phenotypemu
is modeled as normal distribution. The column shows fitted standard deviation ofmu
that quantify the uncertainty of the variant effect.mu_z
: z-score ofmu
sd
: Standard deviation of variant phenotype, given the wild type has standard normal phenotype distribution ofmu = 0, sd = 1
.fdr_dec
: FDR ofmu
being smaller than 0fdr_inc
: FDR ofmu
being larger than 0fdr
: FDR ofmu
being nonzero (=min(fdr_dec, fdr_inc)
) When negative control is provided, above columns with_adj
suffix are provided, which are the corresponding values adjusted for negative control.
output_prefix/bean_sgRNA_result.[model_type].csv
:edit_rate
: Estimated editing rate at the target loci.
- Run options
-o
,--outdir
: Directory to save the run result.--result-suffix
(default: ""): Suffix of the output files.--uniform-edit
(default:False
): Ignores variable editing--scale-by-acc
(default:False
): Scale guide editing efficiency by the target loci accessibility. Needs one ofacc_bw_path
oracc_col
that provides per-guide accessibility information.--acc-bw-path
: Accessibility .bigWig file to be used to assign accessibility of guides. To read accessibility information from the .bigWig file, input.h5ad
file needschr
andgenomic_pos
column in.guides
DataFrame.
--fit-negctrl
(default:False
): Fit the shared negative control distribution to normalize the fitted parameters.--negctrl-col
(default:target_group
): Column in bdata.obs specifying if a guide is negative control. If thebdata.guides[negctrl_col].tolower() == negctrl_col_value
, it is treated as negative control guide.--negctrl-col-value
(default:negctrl
): Column value in bdata.guides specifying if a guide is negative control. If thebdata.guides[negctrl_col].tolower() == negctrl_col_value
, it is treated as negative control guide.
--repguide-mask
(default:repguide_mask
): n_replicate x n_guide mask to mask the outlier guides. bdata.uns[repguide_mask] will be used. This is calculated withbean-qc
.--cuda
(default:False
): Run on GPU--device
: Optionally use GPU if provided valid GPU device name (ex. cuda:0)
--ignore-bcmatch
(default:False
): If provided, even if the screen object has .X_bcmatch, ignore the count when fitting.--allele-df-key
(default:allele_counts
): bdata.uns[allele_df_key] will be used as the allele count.--control-guide-tag
(default:CONTROL
): (Relevant in bean-run tiling mode) If this string is in guide name, treat each guide separately not to mix the position. Used for non-targeting negative controls.
- Guide annotations (
bdata.guides
column keys)--acc-col
: Column name in bdata.guides that specify raw ATAC-seq signal.--target-column
(default:target
): Column key inbdata.guides
that describes the target element of each guide.--guide-activity-col
: Column inbdata.guides
DataFrame showing the editing rate estimated via external tools.
- Sample annotations (
bdata.samples
column keys)--condition-column
(default:bin
): Column key inbdata.samples
that describes experimental condition.-uq
,--sorting-bin-upper-quantile-column
(default:upper_quantile
): Column name with upper quantile values of each sorting bin in bdata.samples-lq
,--sorting-bin-lower-quantile-column
(default:lower_quantile
): Column name with lower quantile values of each sorting bin in bdata.samples
import crispr_bean as be
cdata = be.read_h5ad("bean_counts_sample.h5ad")
See the ReporterScreen API tutorial for more detail.