EHDEN/ETL-UK-Biobank

Primary key visit occurrence failing (2021-03-16)

Closed this issue · 2 comments

2021-03-16 23:56:10,642 - INFO - Unable to add pk_visit_occurrence

From DQD, isPrimaryKey, 76,108 unique visit_occurrence_ids are duplicated (1.51% of all visit records).

To understand where the duplicates are coming from, we can execute the following query:

SELECT cdmTable.data_source,
       count(*),
       sum(CASE WHEN cdmTable.VISIT_OCCURRENCE_ID IN (
           SELECT VISIT_OCCURRENCE_ID
           FROM omopcdm.VISIT_OCCURRENCE
           GROUP BY VISIT_OCCURRENCE_ID
           HAVING COUNT(*) > 1
       ) THEN 1 ELSE 0 END) AS n_pk_duplicates
FROM omopcdm.VISIT_OCCURRENCE cdmTable
GROUP BY cdmTable.data_source;

Examination of real data shows that PR #274 indeed solves this issue.

All pk clashes are in gp_covid.

data_source count n_pk_duplicates
covid19 gp_emis 2452889 38054
covid19 gp_tpp 1299931 38054