Separate Quality Flag columns for each quality flag or each type of quality flag
heather-i opened this issue · 14 comments
In order to further automate our data management and reporting systems to be compatible with the Ontario Data Template based on the Open Data Model it would be helpful to split out the Quality Flag column in the WWMeasure tab to more easily support when there is more than one quality flag for a data point. According to the Protocol Evaluations for qPCR Performance used by the MECP and OCWA (attached), there are the following qualifiers:
B - background contamination observed (greater than 5 Cq away from samples; indicating that it would not affect quantification but it is present)
FI - failed inhibition
AI - addressed inhibition
ND - non-detect
J - concentration estimate extrapolated based on extending experiment-specific standard curve to the y-intercept
UJ - "Trace" amplification of target; concentration estimate extrapolated based on extending experiment-specific standard curve beyond the y-intercept
It would be easiest if all of these had their own column that could be true/false or if they were at least grouped into the type of quality flag. Ex. a column flags that correspond to concentration (ND, J, and UJ) since each sample can only be one of these and flags that correspond to inhibition (FI and AI) since each sample can only have one of these.
I know this is very Ontario-specific so I understand if it is not possible to make these changes but if it is possible to include them and have them be optional for users of the Open Data Model who do not need these separated, then that would be wonderful.
Thanks!
There was a conversation on this topic in today's ODM Implementation meeting.
Multiple people present agreed that multiple quality flags could be relevant for a single measure (e.g., control curve quality concerns + inhibition quality concerns). Though qPCR was at the forefront of this discussion, it is easy to imagine that this could be the case for other types of measures as well (sequencing, sampling, etc.)
There are multiple ways to deal with a measure having many quality concerns. Some of them would require modifying the structure of the ODM slightly.
(A) Option that doesn't require changes to the ODM structure.
- Only recording the "most important" quality flag for a given measure.
Advantages Straightforward to implement (only documentation is needed)
Drawbacks: It is unclear and possibly impossible for users to determine whether a given flag is more or less important than another.
(B) Option that requires minimal changes
- Having multiple
qualityFlag
fields (qualityFlag1
,qualityFlag2
, ...) in themeasures
andsamples
tables.
Advantages: This would be very straightforward to implement, and it wouldn't require the creation of new tables. Drawbacks: 1) It widens the report tables by several fields 2) It might seem like they should be interpreted as having a hierarchy (isqualityFlag1
more important thanqualityFlag3
because it is ranked first?) 3) It opens the door to adding lists in many other places in the ODM, potentially mucking up the overall structure.
Options that require adding new tables
The linkages between quality flags and measures / samples could be done in (at least) 2 ways:
Option C: "Loose" linkage
This option:
- Removes the
qualityFlag
field from themeasures
andsamples
tables
*Creates a table with the following fields:- unique id (primary key)
- measureID
- sampleID
- qualityFlag
Users would thus creates as many rows as they need (each linked with the specific measure or sample they want to qualify). They would need to add the quality information into this new table themselves.
Option D: "Hard" linkage
This option takes advantage from the fact that quality flags are pre-determined by the dictionary. Therefore, all the possible combinations of quality flags that could be reported can be inferred by the contents of the dictionary itself. Say, in measure x
's qualitySet, that there are three possible flags: A
, B
, and C
. We thus immediately know that the quality concerns for a measurement of x
can only be one of {[], [A], [B], [C], [A,B], [A, C], [B, C], [A, B, C]}
. A new table (say, qFCombinations
) could be automatically be generated from the contents of each qualitySet
, with each combination having its unique id.
Then , the measure
and sample
tables only need to replace their qualityFlag
field with a qfCombinationID
field to link the measure / sample to the correct combination of flags.
Advantages It maintains an explicit link between the measures ans samples tables with the quality measures, and it allows users to keep filling all their values only in the samples and measures tables.
Disadvantages The number of permutations grows geometrically with each new flag in a set, which could become unweildy over time, and it adds another step to the dictionary generation (i.e., every time a quality flag is added to a qualityflagSet, new permutations must also be added the the qfCombination table.
These options aren't exhaustive, but hopefully they get the conversation rolling on the best way forward :)
Another aspect of quality that was mentioned in the meeting was how to report LOD / LOQ for measures.
The dictionary is flexible in this regard, so it would probably be good to agree on a common way of doing things.
Here are all the options I can think of:
- Add
loq
andlod
values as rows in the `measures table. - Add
loq
andlod
as methodSteps in the MethodSet table. Then, link the measures that use that assay with the correct methodSetID - Add
loq
andlod
values as rows in the `measures table AND link them to the relevant lab measurement with a measureSetID. Thus, by looking through all the measures of a given measureSetID, we would find for each qPCR measurement:- the qPCR measurement
- its associated lod
- its associated loq
- Add a field to the
measures
table forlod
and another forloq
- If Option C is selected for the quality flags,
lod
andloq
could be added as fields there.
The issue I see with lod and loq being in the quality table is that then it's hard to link the value to the right unit.
If lod and loq are proporties in the measure, then the unit can be assumed to be the same as for the reported value.
If lod and loq are their own rows, aither in MethodSteps or measures, they can have their own unit without having to worry about the unit used by specific measurements.
Thank you for summarizing the discussion from the ODM implementation meeting and clearly describing the advantages and drawbacks of each option!
I will break up my thoughts into Ontario Data Template/MECP-specific notes and general ODM notes:
Ontario Data Template/MECP-specific notes
Option B would be my preference.
Pros:
- allows for the immediate inclusion of multiple quality flags associated with RT-qPCR data
- no new table to try to format and maintain
- easiest to fit into our lab's (University of Waterloo) current data analysis and reporting workflow
Cons:
- leaving it open-ended as to which quality flag will go into column1, column2 would be problematic for filtering data and using it further. This could be remedied by allowing users (MECP, managers of the Ontario Data Template) to define which quality flag corresponds to which column (ex. qualityflag1 = quantity flags; ND, <LOD, <LOQ and qualityflag2 = inhibition flags; FI, AI) and qualityflag3 = contamination flags; B)
General ODM notes
If I understand correctly, the ODM is set up to be formatted for a number of different users and so the WWMeasure table can be used for data produced from labs measuring any biologic, toxin, or other health risk, using any number of techniques or assays. Therefore, there is the necessity to make it both very flexible to accommodate all possible uses/users as well as customizable to it is able to capture highly precise data for each use/user. This may be a very naïve understanding of the ODM so please take the following comments lightly.
I propose to give users (MECP in my case) the ability to select a customized WWMeasure table based on the assayMethod. For example, if the assay method is RT-qPCR, the WWMeasure table will have quality flag columns for this technique but if the assay method is sequencing, the WWMeasure table would be altered to accommodate that data type.
Advantages: ensures data from any assay is being recorded with all caveats/flags so that only the highest quality data is being used for interpretation; increases user friendliness when reporting as the columns are understood by the labs/persons producing the data from each type of assay.
Drawbacks: I would anticipate that this would be a lot of work to coordinate this and do not want to take that lightly.
LOD/LOQ thoughts
I believe this should be part of the Quality flag column (as it currently is in the ODT) as these values can change over time as improvements/changes to assays occur, so it is easier to note if each of the values reported in the WWMeasure table are below the LOD or LOQ at the time of reporting.
Note: I am also in the process of communicating this to Vince Pileggi and Sherif Hegazy (MECP; points of contact for the Ontario Data Template) so again please ignore if these changes/thoughts are not relevant to the ODM itself.
Just a clarification that this issue discussion is referencing versions 1.1 and 2.0.
@heather-i references are mostly about v1.1. @jeandavidt references are mostly about version 2.0.
Version 2.0 expands the dictionary and the model quite a bit. The name change from WWMeasure
(wastewater measure) to measures
reflects that measures can be for water, air or surface and more robustly include population measures (testing, hospitalization, etc.).
In version 2.0:
-
LOD and LOQ change from headers within the
measures
tables to what is described by @jeandavidt in this issue (a row in the measure table linked to measures usingmeasureSetID
or within themethods
table). There are a few reasons for this change. Most notably,LOD
andLOQ
are relevant to specific measures, such as PCR measures. There is a considerable increase in measures, such as chemical and physical properties, where LOD and LOQ don't apply. -
Quality sets (
qualitySet
) are introduced. Currently, four quality sets are described, but more can be added at any time: Generic Quality Flag Set; PCR Quality Set; Sample Quality Set; Sequencing Quality Set. Each measure can have a quality set.
As an aside, in Version 2.0, there are also aggregation sets (aggSet
), and unit sets (unitSet
). So, each measure has an aggregation set, unit set and quality set. A unit set for temperature (degrees celsius) is different from a unit set for SARS-CoV-2 N1 gene region detection by PCR (gene copies per l, gene copies per copies of PPMoV, etc.)
- Measure sets (
measureSetID
) are introduced. Measures sets allow groups of measures to be associated with each other. There are several use cases for measure sets, but they are generic and flexible. Associating LOQ and LOD to a group of measures was one identified use case. Other use cases include:
- variants - when performing variant testing, there may be multiple variants identified, etc.
- controls - generating and reporting Ct curves or performing dilutions, spike samples, etc.
Measures and samples can also be grouped, but there are slightly different considerations. Samples have the provision for having parent, child, combined samples, etc. Methods have methodSteps that can be grouped, and then groups can be combined. For example, there could be several RNA extraction steps that can be grouped together and then added with other groups of steps for, say, concentration, PCR, etc. to form an overall assay method.
Remember that we’ll want our quality measures to work in both ‘long’ and ‘wide’ data formats. I don’t foresee major issues with any proposed solutions, but there are a few considerations and implementation issues. We’ll likely want the core ODM development team to review how to generate long names before we sign off on a reporting approach.
Long data is the main ODM data format, but version 2 provides better support for wide tables with an explicit formula for generating wide names. Variable names for wide tables can get very long because the names are a connotation of attributes. See below. This means that we’ll want short part names for quality measures.
The figure below is preliminary and not quite up-to-date. Regardless the figure informs the general approach.
For Option B, what is the implementation? Do we need key:value pairs? Maybe even key:value:unit (for numerical quality measures)? @mathew-thomson @heather-i
-
Key:value pairs:
qf1_partID
,qf1_value
,qf2_partID
,qf2_value
.
qf1_partID
= J,qf1_value
= TRUE. -
Use the partID as the name, and then the entry is the value.
qf1_J
qf1_J
= True. -
Have the quality measure as the value and assume TRUE.
qf1
qf1
=J
The above approaches also need to work for quality measures that are not Boolean but real numbers: measures such as LOQ, concentration estimate, etc. Key value pairs for these measures, and also implementation 2. The value measures need an accompanying unit and maybe also an aggregation.
A challenge for implementation 2 is a proliferation of qf1_ variables as headers. Remember that there are measures other than PCR. Currently, in version 2, there are 22 quality measures, which would mean adding 22 variable headers to the measuring table -- of which most are not relevant for any one measure. Now you've got a wide-table format instead of a long-table format.
So for version 2.0, we propose going for option C:
- A new quality table is created
- Each row of the quality table creates a new quality flag
- Quality flags can be linked to either of the following (1 per row): measure report, measure set report, sample report
- Any sample, measure set or measure can have as many flags as required
Addressing this problem raised the issue of measure sets vs sets of quality flags. The question was: since we are now linking several quality flags to a measure, isn't this the same as creating a measure set, but for quality?
I looked into this, and they turn out to be different. The difference is in the number of links the different entities can have together:
- A measure, measure Set or sample can be linked to many quality flags
- A measure set can be linked to many measures
- One quality flag can only be linked to a single measure or measure set at the same time.
- At the moment, a measure can be linked to a single measure set at the same time.
So:
- the linkage between measures and qualityflags is n:1 and
- the linkage between measures and measure sets is 1: n
i.e., the directionality of the one-to-many relationship is inverted.
So we can't use the quality table to replace the measure set table.
But a thing we might want to do is to allow one measure to belong in many sets (say, a set of replicates and a set of all the measures that were done with the same calibration curve). For that, we need to turn the relationship between measures and measureSets from n:1 to n:n
n:n linkages require a lookup table. Thus, the setup would be the following:
- The measureReport Table loses its measureSetReportID field
- The measureSetReport table is still there to store the unique ID of each new set. We can also choose to give names to the sets and / or a type (e.g., qualitycontrolSet or ReplicateSet or whatever)
- A new lookup table (maybe, MeasureSetHasMeasures) that has no primary key field and 2 foreign keys as fields (MeasureSetID, measureReportID)
I put these changes into the ERD for review and discussion
All points by @jeandavidt look good.
The first task we need to complete is how to store data that address required use cases for quality measures. What @jeandavidt suggested does address use cases that have been discussed - in particular, the main issue thread and the ability to store multiple quality measures.
The second task is to recommend data easy data collection for common uses. @heather-i has a good point that Option B is easy to implement and understand for many users and use cases.
- As an interim step, we could implement Option B for version 1.2 and then have a more robust qualityFlag table for version 2.
- In practice, people could generate input templates or data collection that has option B as wide variable names.
- mixing wide and long variables works if people use a data collection table for selected measures such as PCR alleles, but the solution is not robust for multiple measure with different qualityFlagSets.
- we probably want default ODM templates that are robust and so they wouldn't include this practice.
We will need to decide whether we want the 'reportable' attribute and how that would be used. From the discussions:
reportable
is an important attribute that we want to keep. This attribute is widely requested and it is also a helpful flag that there is or should be a corresponding entry in the qualitySet table.- There is the greatest support to have as a boolean. True = 'report'; False= 'don't report'. However, some people would prefer more categories.
- It is possible to have a more nuanced summary reportable attribute in the qualitySet table and to support a nice, clean boolean 'reportable' attribute in the measure and sample tables.
I tend to support updating the measureSetReport table to allow n:n, but we haven't received many requests that require this more robust structure. However, the more robust structure:
- is more consistent with the approach of the ODM model.
- similar to qualityFlag, we could have a flag in the measures table that says the measure is part of a measureSet. This helps users know to add entries to measureSet (if data generators) or look at the measureSet table (if data users).
- for data storage, we would remove mesaureSetReportID from the measures table. But similar to quality flags, data input tables could have this measure for users who have only one measureSet for any specific measure.
Regardless, the idea to add additional descriptors makes sense (e.g., qualitycontrolSet or ReplicateSet or whatever) and those descriptors could be added to the existing measureSet table.
I am not sold on MeasureSetHasMeasures. I find the 'has' tables are conceptual great, but many people are not familiar with them or their application.
Thank you @jeandavidt for laying things out so eloquently, and to build on @DougManuel 's point re: reportable
- one thing that was brought up in our discussion was to continue to use reportable
in the measures
table and add it to the samples
table. To add potential levels of nuance, while still maintaining a final ease of interpretation, a severity
column in the proposed qualityReports
table with a traffic-light-style tier system. This wouldn't replace reportable
, and would be optional, but provides some additional detail on how important a given flag is outside the final yes-no reportable decision.
I also am somewhat conservative about blowing up the measureSets
structure to allow for n:n relationships with measures, but if the labs are supportive of this kind of infrastructure then I think it would be great to build it in before we launch v2.0.
Version 2 will have a specific quality table that can record multiple quality measures for any sample or measure.