cortes-ciriano-lab/SComatic

Understanding step 4.2 variant filters

bkinnersley opened this issue · 1 comments

Hello,

Thank you so much for such a useful tool. I have been applying this to single-cell multiome (RNA-Seq plus ATAC-Seq) libraries, using the parameters as recommended here for the different library types. I am trying to understand the reason variants can be filtered out (as detailed in the sixth "FILTER" column of the output file), and if you could direct me to some documentation on that it would be greatly appreciated. As far as I can see the different filter categories are as follows:

  1. "" (empty string)
  2. "Noisy_site"
  3. "PoN"
  4. "Multi-allelic"
  5. "LC_Upstream"
  6. "LC_Downstream"
  7. "Clustered"
  8. "Cell_type_noise"
  9. "Min_cell_types"

While many are straight-forward to understand, others I am less aure about (particularly "Noisy_site", "LC_Upstream", "LC_Downstream", "Cell_type_noise") so any help with this would be greatly appreciated, thanks very much

Best wishes

Ben

Dear Ben, thanks for your question. The filters are described in the legend of Supplementary Figure 8 in our paper - sorry for not making that info more accessible in the repo:

BetaBin: the candidate mutation was not supported by a sufficient number of reads with the alternate allele to pass the Beta-binomial test;

Cell_type_noise: the number of reads supporting the alternate allele is only significant (Beta-binomial test) when applied to all cells across all cell types considered, but not when when applied to each cell type individually, or there are multiple alternate alleles, which suggests a noisy site;

Clustered: the candidate mutation was filtered because another candidate mutation maps within 5bp;

LC_Upstream: the candidate mutation was filtered because it mapped upstream of a low-complexity region;

LC_Dowstream: the candidate mutation was filtered because it mapped downstream of a low-complexity region; Multiple_cell_types: the variant was found in different cell types of the same sample;

No_reads: no reads supporting the alternative allele were found;

Noisy_site: the candidate mutation filtered because there are a significant number of reads supporting the alternate allele in a single cell type when running the Beta-binomial test for each cell type independently, but the site is also significant when applying the Beta-binomial test to all single cells across all cell types in a sample together;

PoN: variant filtered by the SComatic Panel of Normals (PoNs).

Hope this helps? Thanks