FowlerLab/Enrich2

Feature Request: Pseudocounts for drop-out variants

hezscha opened this issue · 2 comments

Hi,
I'm running Enrich2 on a selection MAVE and have noticed I am unable to get scores for some poorly performing variants because they tend to drop out in later time points during the selection. My PI was wondering if we could alleviate that by introducing pseudo-counts, only for those variants that were clearly present in the initial sample and then decline. We have 3-4 time points plus initial samples and are scoring with WLS regression.

Do you know if this is at all done for MAVE data or if not what the objections are?
And is this something you would consider adding to Enrich2?

This is a question that has come up before, but as you said is not supported by Enrich2. I'll try to explain the reasoning behind not calculating these scores and provide a possible workaround.

Do you know if this is at all done for MAVE data or if not what the objections are?

If you are using ratio-based scores, this might perform well.

The issues come in with regression-based scores with many time points. If a variant drops out early, does it make sense to calculate a strong negative score based on the regression line intercepting the x-axis when the dropout happens? What if the variant drops out in the middle of the experiment and is then seen again in a later time point (due to sampling issues)? The log-linear fit will be very poor and potentially misleading.

We were not able to determine a general solution to these issues, and did not have sufficient test data to approach the problem at the time, so we went ahead and filtered out these variants.

And is this something you would consider adding to Enrich2?

Enrich2 is no longer under active development, but I have added this feature request to the successor project.

If you would like to add a pseudocount, my suggestion is:

  1. Count the variants using Enrich2 in counts-only mode - do not calculate scores.
  2. Open the HDF5 file in a Jupyter notebook or similar environment and add the pseudocount to all relevant count tables.
  3. Re-run Enrich2 using the same configuration file, but enable score calculation. It will automatically detect that the counts are already present, and use these modified counts to calculate variant scores.

Please let me know if you need extra assistance getting this set up. There are some example notebooks in the documentation that show how to open the HDF5 files, but the code may be out of date.

Thanks for the reply Alan!
I see what you mean about it becoming problematic when doing regression. I have tested how well ratio-based scores and regression-based scores correlate for our data and the correlation was quite good so we might use that and add pseudo counts to the after-selection library using your suggestion.