Missing a lot of sequences

Currently we are only displaying 3 samples categorized as BA.2.86:

However the GISAID database currently holds 776 records with this lineage label.

[covizu@Paphlagon data]$ unxz -c provision.2024-04-13T00\:00\:06.json.xz | grep -c "\"covv_lineage\": \"BA.2.86\""
947

=# select count(accession) from sequences where lineage='BA.2.86';
 count
-------
   916
(1 row)

It looks like we have 916 records in the database. Checking to see if these sequences are being filtered

Of the 918 BA.2.86 records that made it to the filter_problematic function:

covizu/covizu/utils/gisaid_utils.py

Lines 196 to 267 in ca3379d

    
           def filter_problematic(records, origin='2019-12-01', rate=0.0655, cutoff=0.005, 
        
                                  maxtime=1e3, vcf_file='data/ProblematicSites_SARS-CoV2/problematic_sites_sarsCov2.vcf', 
        
                                  encoded=False, 
        
                                  misstol=300, callback=None): 
        
               """ 
        
               Apply problematic sites annotation from de Maio et al., 
        
               https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473 
        
               which are published and maintained as a VCF-formatted file. 
        
               :param records:  generator, records from extract_features() or encode_diffs() 
        
               :param origin:  str, date of root sequence in ISO format (yyyy-mm-dd) 
        
               :param rate:  float, molecular clock rate (subs/genome/day), defaults 
        
                             to 8e-4 * 29900 / 365 
        
               :param cutoff:  float, use 1-cutoff to compute quantile of Poisson 
        
                               distribution, defaults to 0.005 
        
               :param maxtime:  int, maximum number of days to cache Poisson quantiles 
        
               :param vcf_file:  str, path to VCF file 
        
               :param encoded: bool, flag to indicate if records are from encode_diffs()  
        
               :param misstol:  int, maximum tolerated number of uncalled bases 
        
               :param callback:  function, option to print messages to console 
        
               :yield:  generator, revised records 
        
               """ 
        
               # load resources 
        
               mask = load_vcf(vcf_file) 
        
               qp = QPois(quantile=1-cutoff, rate=rate, maxtime=maxtime, origin=origin) 
        
               n_sites = 0 
        
               n_outlier = 0 
        
               n_ambig = 0 
        
               for record in records: 
        
                   if type(record) is not dict: 
        
                       qname, diffs, missing = record  # unpack tuple 
        
                   else: 
        
                       diffs = record['diffs'] 
        
                   # exclude problematic sites 
        
                   filtered = [] 
        
                   for typ, pos, alt in diffs: 
        
                       if typ == '~' and int(pos) in mask and alt in mask[pos]['alt']: 
        
                           continue 
        
                       if typ != '-' and 'N' in alt: 
        
                           # drop substitutions and insertions with uncalled bases 
        
                           continue 
        
                       filtered.append(tuple([typ, pos, alt])) 
        
                   ndiffs = len(filtered) 
        
                   n_sites += len(diffs) - ndiffs 
        
                   if not encoded: 
        
                       record['diffs'] = filtered 
        
                       # exclude genomes with excessive divergence from reference 
        
                       coldate = record['covv_collection_date'] 
        
                       if qp.is_outlier(coldate, ndiffs): 
        
                           n_outlier += 1 
        
                           continue 
        
                       # exclude genomes with too much missing data 
        
                       if total_missing(record) > misstol: 
        
                           n_ambig += 1 
        
                           continue 
        
                       yield record 
        
                   else: 
        
                       if type(record) is dict: 
        
                           qname, missing = record['qname'], record['missing'] 
        
                       yield [qname, filtered, missing] 
        
               if callback: 
        
                   callback("filtered {} problematic features".format(n_sites)) 
        
                   callback("         {} genomes with excess missing sites".format(n_ambig)) 
        
                   callback("         {} genomes with excess divergence".format(n_outlier))

850 sequences were filtered out as being outliers, 65 were filtered out for having a lot of missing sites

Ok I think we have to turn off molecular clock filtering for now. Let's do the following:

pass cutoff=0 to filter_problematic
set qp = None if cutoff==0 around line:

qp = QPois(quantile=1-cutoff, rate=rate, maxtime=maxtime, origin=origin)

modify test if qp.is_outlier(coldate, ndiffs): to if qp and qp.is_outlier(coldate, ndiffs): to skip test if qp is None

Alternatively it might be easier to modify QPois to return False for every call to is_outlier when it is initialized with cutoff=0

Reprocessing everything in the database to update variants with this change

	def filter_problematic(records, origin='2019-12-01', rate=0.0655, cutoff=0.005,
	maxtime=1e3, vcf_file='data/ProblematicSites_SARS-CoV2/problematic_sites_sarsCov2.vcf',
	encoded=False,
	misstol=300, callback=None):
	"""
	Apply problematic sites annotation from de Maio et al.,
	https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473
	which are published and maintained as a VCF-formatted file.

	:param records: generator, records from extract_features() or encode_diffs()
	:param origin: str, date of root sequence in ISO format (yyyy-mm-dd)
	:param rate: float, molecular clock rate (subs/genome/day), defaults
	to 8e-4 * 29900 / 365
	:param cutoff: float, use 1-cutoff to compute quantile of Poisson
	distribution, defaults to 0.005
	:param maxtime: int, maximum number of days to cache Poisson quantiles
	:param vcf_file: str, path to VCF file
	:param encoded: bool, flag to indicate if records are from encode_diffs()
	:param misstol: int, maximum tolerated number of uncalled bases
	:param callback: function, option to print messages to console
	:yield: generator, revised records
	"""
	# load resources
	mask = load_vcf(vcf_file)
	qp = QPois(quantile=1-cutoff, rate=rate, maxtime=maxtime, origin=origin)

	n_sites = 0
	n_outlier = 0
	n_ambig = 0
	for record in records:
	if type(record) is not dict:
	qname, diffs, missing = record # unpack tuple
	else:
	diffs = record['diffs']

	# exclude problematic sites
	filtered = []
	for typ, pos, alt in diffs:
	if typ == '~' and int(pos) in mask and alt in mask[pos]['alt']:
	continue
	if typ != '-' and 'N' in alt:
	# drop substitutions and insertions with uncalled bases
	continue
	filtered.append(tuple([typ, pos, alt]))

	ndiffs = len(filtered)
	n_sites += len(diffs) - ndiffs

	if not encoded:
	record['diffs'] = filtered

	# exclude genomes with excessive divergence from reference
	coldate = record['covv_collection_date']
	if qp.is_outlier(coldate, ndiffs):
	n_outlier += 1
	continue

	# exclude genomes with too much missing data
	if total_missing(record) > misstol:
	n_ambig += 1
	continue
	yield record
	else:
	if type(record) is dict:
	qname, missing = record['qname'], record['missing']

	yield [qname, filtered, missing]

	if callback:
	callback("filtered {} problematic features".format(n_sites))
	callback(" {} genomes with excess missing sites".format(n_ambig))
	callback(" {} genomes with excess divergence".format(n_outlier))