PoonLab/covizu

Error retrieving a record from the database

Closed this issue · 2 comments

Pipeline is failing because it is trying to retrieve a record from the database that doesn't exist.

For example, sequences for lineage XDT were inserted into the database, however, the cluster information was not.

Originally the thought was that the lineage assignment changed for a sequence in a recent provisions file. However, that does not seem to be the case.

@GopiGugan currently rebuilding database, will investigate whether cluster information is reproducibly failing to be inserted for XDT

For example, sequences for lineage XDT were inserted into the database, however, the cluster information was not.

I believe there isn't a cluster record for XDT since the XDT records were previously filtered in the filter_problematic function.

def filter_problematic(records, origin='2019-12-01', rate=0.0655, cutoff=0.005,
maxtime=1e3, vcf_file='data/ProblematicSites_SARS-CoV2/problematic_sites_sarsCov2.vcf',
encoded=False,
misstol=300, callback=None):
"""

Lineages that appear in by_lineage but were not previously inserted into the clusters table should be processed again:

diff --git a/batch.py b/batch.py
index 878ac70..91641d8 100644
--- a/batch.py
+++ b/batch.py
@@ -435,7 +435,17 @@ if __name__ == "__main__":
         SELECT DISTINCT LINEAGE FROM NEW_RECORDS;
         '''
         CUR.execute(UPDATED_LINEAGES_QUERY)
-        UPDATED_LINEAGES = [row['lineage'] for row in CUR.fetchall()]
+        new_records_lineages = [row['lineage'] for row in CUR.fetchall()]
+
+        by_lineage_list = list(by_lineage.keys())
+        clusters_lineages_query = '''
+        SELECT DISTINCT LINEAGE FROM CLUSTERS;
+        '''
+        CUR.execute(clusters_lineages_query)
+        clusters_lineages = [row['lineage'] for row in CUR.fetchall()]
+        unique_by_lineage = list(set(by_lineage_list) - set(clusters_lineages))
+
+        UPDATED_LINEAGES = list(set(new_records_lineages).union(set(unique_by_lineage)))