Missing a lot of sequences
Closed this issue · 6 comments
ArtPoon commented
ArtPoon commented
[covizu@Paphlagon data]$ unxz -c provision.2024-04-13T00\:00\:06.json.xz | grep -c "\"covv_lineage\": \"BA.2.86\""
947
GopiGugan commented
=# select count(accession) from sequences where lineage='BA.2.86';
count
-------
916
(1 row)
It looks like we have 916 records in the database. Checking to see if these sequences are being filtered
GopiGugan commented
Of the 918 BA.2.86 records that made it to the filter_problematic
function:
covizu/covizu/utils/gisaid_utils.py
Lines 196 to 267 in ca3379d
850 sequences were filtered out as being outliers, 65 were filtered out for having a lot of missing sites
ArtPoon commented
Ok I think we have to turn off molecular clock filtering for now. Let's do the following:
- pass
cutoff=0
tofilter_problematic
- set
qp = None
ifcutoff==0
around line:
qp = QPois(quantile=1-cutoff, rate=rate, maxtime=maxtime, origin=origin)
- modify test
if qp.is_outlier(coldate, ndiffs):
toif qp and qp.is_outlier(coldate, ndiffs):
to skip test ifqp
is None
ArtPoon commented
Alternatively it might be easier to modify QPois
to return False
for every call to is_outlier
when it is initialized with cutoff=0
ArtPoon commented
Reprocessing everything in the database to update variants with this change