PoonLab/covizu

Missing a lot of sequences

Closed this issue · 6 comments

Currently we are only displaying 3 samples categorized as BA.2.86:
image

However the GISAID database currently holds 776 records with this lineage label.

[covizu@Paphlagon data]$ unxz -c provision.2024-04-13T00\:00\:06.json.xz | grep -c "\"covv_lineage\": \"BA.2.86\""
947
=# select count(accession) from sequences where lineage='BA.2.86';
 count
-------
   916
(1 row)

It looks like we have 916 records in the database. Checking to see if these sequences are being filtered

Of the 918 BA.2.86 records that made it to the filter_problematic function:

def filter_problematic(records, origin='2019-12-01', rate=0.0655, cutoff=0.005,
maxtime=1e3, vcf_file='data/ProblematicSites_SARS-CoV2/problematic_sites_sarsCov2.vcf',
encoded=False,
misstol=300, callback=None):
"""
Apply problematic sites annotation from de Maio et al.,
https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473
which are published and maintained as a VCF-formatted file.
:param records: generator, records from extract_features() or encode_diffs()
:param origin: str, date of root sequence in ISO format (yyyy-mm-dd)
:param rate: float, molecular clock rate (subs/genome/day), defaults
to 8e-4 * 29900 / 365
:param cutoff: float, use 1-cutoff to compute quantile of Poisson
distribution, defaults to 0.005
:param maxtime: int, maximum number of days to cache Poisson quantiles
:param vcf_file: str, path to VCF file
:param encoded: bool, flag to indicate if records are from encode_diffs()
:param misstol: int, maximum tolerated number of uncalled bases
:param callback: function, option to print messages to console
:yield: generator, revised records
"""
# load resources
mask = load_vcf(vcf_file)
qp = QPois(quantile=1-cutoff, rate=rate, maxtime=maxtime, origin=origin)
n_sites = 0
n_outlier = 0
n_ambig = 0
for record in records:
if type(record) is not dict:
qname, diffs, missing = record # unpack tuple
else:
diffs = record['diffs']
# exclude problematic sites
filtered = []
for typ, pos, alt in diffs:
if typ == '~' and int(pos) in mask and alt in mask[pos]['alt']:
continue
if typ != '-' and 'N' in alt:
# drop substitutions and insertions with uncalled bases
continue
filtered.append(tuple([typ, pos, alt]))
ndiffs = len(filtered)
n_sites += len(diffs) - ndiffs
if not encoded:
record['diffs'] = filtered
# exclude genomes with excessive divergence from reference
coldate = record['covv_collection_date']
if qp.is_outlier(coldate, ndiffs):
n_outlier += 1
continue
# exclude genomes with too much missing data
if total_missing(record) > misstol:
n_ambig += 1
continue
yield record
else:
if type(record) is dict:
qname, missing = record['qname'], record['missing']
yield [qname, filtered, missing]
if callback:
callback("filtered {} problematic features".format(n_sites))
callback(" {} genomes with excess missing sites".format(n_ambig))
callback(" {} genomes with excess divergence".format(n_outlier))

850 sequences were filtered out as being outliers, 65 were filtered out for having a lot of missing sites

Ok I think we have to turn off molecular clock filtering for now. Let's do the following:

  • pass cutoff=0 to filter_problematic
  • set qp = None if cutoff==0 around line:
qp = QPois(quantile=1-cutoff, rate=rate, maxtime=maxtime, origin=origin)
  • modify test if qp.is_outlier(coldate, ndiffs): to if qp and qp.is_outlier(coldate, ndiffs): to skip test if qp is None

Alternatively it might be easier to modify QPois to return False for every call to is_outlier when it is initialized with cutoff=0

Reprocessing everything in the database to update variants with this change